Bug doubling the duplicates in the auroc evaluations

Hi,

I found a bug in the current evaluation script, which doubles the duplicates in the dataset. The respective line is located here and looks like: result_df = generations_df.merge(similarities_df, on='id').merge(likelihoods_df, on='id')

The bug happens because merge creates additional rows whenever id has duplicates in both dataframes. Since only generations_df has duplicates but not similarities_df, the first merge is still fine and does not introduce new rows. The bug happens in the second merge, since likelihoods_df also has duplicates. As a consequence, all the evaluation metrics are more heavily influenced by already existing duplicates. I am not sure how large the impact on the results in your paper is, since I did not manage to reproduce your original results in the first place. If you can provide me with the exact numbers of the figures in your paper, I can better compare our evaluations.

A potential fix would be to replace the line above with result_df = pd.concat([generations_df, likelihoods_df.drop('id', axis=1)], axis=1).merge(similarities_df, on='id')

This works since the rows in generations_df and likelihood_df have the same order, so we do not need a matching logic (confirm via (generations_df['id'] == likelihoods_df['id']).all()).

Greetings, Sebastian

lorenzkuhn / semantic_uncertainty

Bug doubling the duplicates in the auroc evaluations #8