mlfoundations / datacomp

DataComp: In search of the next generation of multimodal datasets
http://datacomp.ai/
Other
642 stars 54 forks source link

FMoW dataset and results variance #61

Closed teasgen closed 1 year ago

teasgen commented 1 year ago

Hi, I'm using datacomp evaluation and it seems that FMoW dataset dramatically increases variance. The main metric is 'worst-region accuracy'. There are 5 regions, 4 of them have more than 700 samples. But 1 have only 4 images. It means that it's possible when the answer in 1 image can change the FMoW metric from 0 to 0.25. The average will be changed to 0.25/38≈0.0066 accordingly. For instance, average accuracy 70.0 and average accuracy 69.4 may differ by the answer in one picture!

Because it's impossible to improve the dataset, I suggest just to remove this region from predictions

gabrielilharco commented 1 year ago

Hi @teasgen, thanks for the comment. I agree that some datasets in the evaluation suite are a bit noisy. We have some analysis on this in Appendix N, section "Clean subset". We didn't find substantial differences in trends when using a cleaner subset of the datasets, but we did not exclude FMoW for that. This may be something we want to revisit, I'm tagging others here to see if they have thoughts. @sagadre @afang-story @yaircarmon @Vaishaal @ludwigschmidt