result inconsistency - Githubissues

princeton-nlp / c-sts

[EMNLP 2023] C-STS: Conditional Semantic Textual Similarity

65 stars 6 forks source link

result inconsistency #2

Closed baochi0212 closed 10 months ago

baochi0212 commented 11 months ago

Dear team, Thanks for impactful work. I have an issue when trying to reproduce the result in paper, which turns out to be different when I try flan T5 with k2_short/long for Spearman correlation score (reported in paper) in test set. Here is example of T5 large in k2_long setting. Can you please help me clarify this problem? SHORT: Results via email: (0.13) Results in paper: (10.9) LONG: Results via email (0.13) Results in paper: 4.4 Thanks team

baochi0212 commented 11 months ago

Hope you can help me soon thanks!

carlosejimenez commented 11 months ago

Hello @baochi0212,

The prior T5 scores reported in the ArXiv version of this paper are incorrect, since they were mistakenly evaluated in half precision. An updated version of this paper will be released shortly with more accurate numbers. For instance, Flan-T5-XL on K=2 with short instructions gets 0.253 (25.3) Spearman r, and 0.263 (26.3) Spearman r for long instructions, when evaluated in tf32 precision.

We evaluated Flan-T5-Large on the validation split (not test), and find it gets 0.127 (12.7) Spearman r for K=2/SHORT and 0.111 (11.1) Spearman r for K=2/LONG, so your test results are a little bit higher but within the ballpark it seems. Just remember that the reported numbers are the raw correlation coefficients. We multiply by 100 in the paper to improve readability.

baochi0212 commented 11 months ago

thanks for your reply @carlosejimenez, but when I do ablation study in validation set, there are some cases that scored 5, but I still don't understand, for example: Sentence 1: Young woman in orange dress about to serve in tennis game, on blue court with green sides. Sentence 2: A girl playing tennis wears a gray uniform and holds her black racket behind her. Condition: The color of the dress. Label: 5 Can you explain ? thanks

baochi0212 commented 11 months ago

and I witness many cases featuring conditions like this case, is it due to annotating quality ?

carlosejimenez commented 10 months ago

As with any crowd-worker derived datasets, there will be some noise and subjectivity. I suspect these cases are simply a result of the annotation process.

baochi0212 commented 10 months ago

Is test set 100% correct?

baochi0212 commented 10 months ago

ok seems like your datasets is not legit 😂