Gaps between local evaluation and online evaluation

jerbarnes / semeval22_structured_sentiment

SemEval-2022 Shared Task 10: Structured Sentiment Analysis

75 stars 42 forks source link

Gaps between local evaluation and online evaluation #16

Closed Stareru closed 2 years ago

Stareru commented 3 years ago

Hello, I have evaluated my predictions locally with "./evaluation/evaluates.py" and "dev.json" (the latest version) as the gold file before submitted them to Codalab. However, results from online evaluation are much lower than my local results (espeically for norec, a 9.4 drop in SF1). Am I using the wrong gold file? Thank you!

Stareru commented 3 years ago

I submitted "dev.json" to codalab and this resulted in SF1=1.00. Might there be some differences between the local script "evaluate.py" and the script online for evaluation in codalab?

jerbarnes commented 3 years ago

Hi,

You're correct that there was a discrepancy between the evaluation script in the repo and the one on codalab. I've corrected this now. Let me know if you still see the difference.

congchan commented 3 years ago

Hi, we just use the same submitted file for testing, online score is about 9% lower than local.

Stareru commented 2 years ago

Sorry,

I have just submitted my results again, but it seems that the gap still exists.

janpf commented 2 years ago

I submitted on Friday and I noticed a 4% decline in performance online. I'm voting to reopen this issue ;)

Thanks! 🙋🏻‍♂️

jerbarnes commented 2 years ago

Thanks to @akutuzov , finally figured out what was going on. Codalab runs python 2.7.13 to run the evaluation scripts and uses floor division by default :/ So there were incorrect values being returned without error messages in several functions. Added from __future__ import division and should be solved now. Thanks again for pointing this problem out and sorry for the inconvenience.

jerbarnes commented 2 years ago

I'll close the issue for now, but if you see any further problems, feel free to reopen.