huggingface / lighteval

LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.
MIT License
471 stars 55 forks source link

Adding support for Arabic benchmarks : AlGhafa benchmarking suite #95

Closed alielfilali01 closed 3 months ago

alielfilali01 commented 4 months ago

AlGhafa benchmarking suite, consist of 11 dataset presented in this paper and hosted in this repo in the Hub

clefourrier commented 4 months ago

Do you want us to wait for Alghafa 2 to merge this?

alielfilali01 commented 4 months ago

Do you want us to wait for Alghafa 2 to merge this?

Yes please @clefourrier , i will take some time before Saturday to add the new version of the benchmark

clefourrier commented 4 months ago

No hurries, take your time!

alielfilali01 commented 3 months ago

Hello @clefourrier , I believe this PR is ready to be merged

alielfilali01 commented 3 months ago

LGTM but you need to homogeneize your naming:

  • Prompt names such as boolq_function will be unclear long term. For such functions, you could either use boolq_prompt_arabic or just boolq_arabic. (You need to specify the language since there is already a boolq prompt function by default.)
  • You also need to homogeneize Alghafa, which exists with several different casings, and fit it to Python style casing. For the prompt fonction, I'd keep it as alghafa_prompt or alghafa, for the class, CustomAlGhafaTask, and here for the name I'd keep it lower case [CustomAlGhafaTask(name=f"alghafa:{subset}", hf_subset=subset) for subset in ALGHAFA_SUBSETS]

Done ✅

alielfilali01 commented 3 months ago

@clefourrier I hope this answers to your comments, plz feel free to ping me if i missed anything (i have a tendency to forget 😅) Again thanks a lot for the efforts 🤗

clefourrier commented 3 months ago

Looks better thank you! Do you have some reference models and scores against which I could check the implementation? Or did you check it, and against which models? :)

alielfilali01 commented 3 months ago

Looks better thank you! Do you have some reference models and scores against which I could check the implementation? Or did you check it, and against which models? :)

Yes @clefourrier , I tested gpt2 using --max_samples=1 and everything was fine and I believe Hamza is on it to test on bigger models and push the results to the hub for further inspection. I'll update you as soon as i hear back from Hamza

clefourrier commented 3 months ago

Sounds good, feel free to ping me whenever :)

thevexx commented 2 months ago

AlGhafa eval dataset is no longer available on Huggingface, any alternatives ?

alielfilali01 commented 2 months ago

AlGhafa eval dataset is no longer available on Huggingface, any alternatives ?

Hi there, Can you plz provide more context ? I have checked the eval code and it seems it works fine

thevexx commented 2 months ago

Hi there, Can you plz provide more context ? I have checked the eval code and it seems it works fine

Hi, yesterday the datasets disappeared from the OALL Huggingface account, now i can see them, thanks

alielfilali01 commented 2 months ago

Hi there, Can you plz provide more context ? I have checked the eval code and it seems it works fine

Hi, yesterday the datasets disappeared from the OALL Huggingface account, now i can see them, thanks

OOH I see, i had to make the datasets private for about 20 min yesterday cuz i was testing something, what a coincidence you checked it at the same time 😅 sorry for the inconvenience 🤗