Fail to reproduce the results for the semantic entropy method

jfc43 commented 1 year ago

I can roughly reproduce the results for the normalized predictive entropy baseline. However, I fail to reproduce the results for the semantic entropy method and the results I got for the semantic entropy method were slightly worse than the normalized predictive entropy baseline. I find that the implementation of judging whether two answers are equivalent is different from what is described in the paper. In the paper, the description is: "The Deberta model then classifies this sequence into one of: entailment, neutral, contradiction. We compute both directions, and the algorithm returns equivalent if and only if both directions were entailment." However, in the code, the implementation is: https://github.com/lorenzkuhn/semantic_uncertainty/blob/main/code/get_semantic_similarities.py#L109-L114. It seems the implemented condition is that if both directions are not contradiction. Please check if this is correct or not. Thanks!

lorenzkuhn commented 1 year ago

Dear Jiefeng,

You are indeed correct that we are using bi-directional non-contradiction as our criterion for semantic equivalence in the code. I have made a few changes to the public release of the code before and after its publication — I will re-run our experiments to verify our results and see whether these updates introduced any bugs. I am also currently working on a simplified implementation of semantic entropy, and I hope to provide the updated code soon. This should make it easier to use in future experiments.

Best, Lorenz

SebGGruber commented 1 year ago

Hi Lorenz,

any news on this issue? You closed it, but for me, the issue of reproducibility still persists.

Best, Sebastian

lorenzkuhn / semantic_uncertainty

Fail to reproduce the results for the semantic entropy method #3