huggingface / evaluate

🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
https://huggingface.co/docs/evaluate
Apache License 2.0
1.98k stars 255 forks source link

🌟 [Metric Request] WOOD score #106

Open astariul opened 4 years ago

astariul commented 4 years ago

WOOD score paper : https://arxiv.org/pdf/2007.06898.pdf

Abstract :

Models that surpass human performance on several popular benchmarks display significant degradation in performance on exposure to Out of Distribution (OOD) data. Recent research has shown that models overfit to spurious biases and ‘hack’ datasets, in lieu of learning generalizable features like humans. In order to stop the inflation in model performance – and thus overestimation in AI systems’ capabilities – we propose a simple and novel evaluation metric, WOOD Score, that encourages generalization during evaluation.

kasmith11 commented 2 years ago

Is this being worked on? If not, I'd like to try! I can do this by following the directions outlined here, correct?

lvwerra commented 2 years ago

Hi @kasmith11, I don't think anybody is working on it right now. Following the guide will create a community metric (i.e. one you can load with load("kasmith/wood"). But to make it an official metric maintained in evaluate we can simply move the code into metrics/ after, so it's a good start and you can test it without needing to merge a PR :)

sezan92 commented 2 years ago

i would also like to work on this one. [new guy here]

kasmith11 commented 2 years ago

I'm very open to collaboration! If you're interested, we can work together on this @sezan92. Would that change anything you outlined above @lvwerra?

lvwerra commented 2 years ago

Sure, if you'd like to collaborate that would be a good issue :) For communication you could join our Discord: https://huggingface.co/join/discord

sezan92 commented 2 years ago

@kasmith11 sorry for late reply. sure. how would you like to begin ?

kasmith11 commented 2 years ago

Hi @sezan92, I took an initial pass at implementing WOOD score here after reading the paper. I haven't gotten a chance to test the implementation or fill out any of the documentation.

I think testing/debugging and documentation are the next steps.

Are you in the huggingface discord linked above? I think that would be a great place for us to communicate via chat going forward.

sezan92 commented 2 years ago

@kasmith11 yes i just joined. my username is sezan92

kasmith11 commented 2 years ago

Fantastic @sezan92. I'll reach out to you via discord soon.

kasmith11 commented 1 year ago

I have a repository with an implementation of WoodScore here. I've had more time to dedicate to this if you are interested still @sezan92