MPEP evaluation next steps

davanstrien commented 7 months ago

Evaluating (open) LLMs for more languages using translated prompts

As part of MPEP, we are now at the point where some translation efforts have successfully translated 500 highly ranked prompts into a new target language. We can do other things with these translated prompts, but our first next step is to use them to evaluate the performance of LLMs for a particular language.

Does LLM as a judge work outside of English?

Many people use LLMs as judges to evaluate the performance of open LLMs without requiring human input. https://github.com/tatsu-lab/alpaca_eval is a well-known example of this approach, and they have demonstrated high agreement with human rankings. However, most of these approaches are targeted towards evaluating models in English. What happens when we want to evaluate non-English models?

Ideally, it would still be compelling to leverage LLMs to judge models for non-English since this lowers the barrier to evaluating models significantly (although it doesn't remove this barrier altogether).

What we want to know is:

does auto/LLM eval work in general for a particular language
which model(s) works best as a judge
do LLMs judgements for non-English models match with human preferences?

A possible approach

For each language with 500 translated prompts we roughly want to do the following:

Evaluate N top models for that language (based on existing leaderboards, vibes, community knowledge etc.)
For every 500 prompts generate a response from each of the candidate models being evaluated

We can then do the following:

Use an LLM to judge the responses of each of these models (ideally we can start using an open LLM for this eventually)
Create a new annotation task in Argilla to compare responses i.e. model 1 vs model 2
We can then compare the human to the LLM rankings to see how/if LLMs work well as a judge

Open questions?

Which model(s) to to use as a judge?
How many models to use as candidates to evaluate (can be decided by each language lead/group IMO)
How best to do the annotations. Ideally, we want to be able to compare human and LLM judgments, but if we're comparing many model outputs, it might be tricky for a human to compare them all directly.

Other ideas

Could an approach like Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models with the same SOA models for a particular language work? i.e., choose 4 of the best open LLMs for Arabic and use those at the pool of raters rather than relying on one powerful judge for Arabic?

tiptales commented 6 months ago

Maybe adding independant metrics can help ranking the judges ? For instance backtranslation + similarities or standard descriptive metrics if some of them correlate?

davanstrien commented 6 months ago

@tiptales Yeah, that could be useful for sure (and even some simpler functions to check language of generations etc.)

kghamilton89 commented 6 months ago

Hi @davanstrien, this looks to be a really cool continuation of MPEP. Especially for languages that already have a fair number of models available.

I suppose there are a handful of deliverables to generalise the process:

Language-specific model leaderboard dashboard (probably can use a clone of the HF LLM leaderboard with a bit of refactoring for this)
A program that can iterate through a dataset and collect responses.
An Argilla space for human assessment.

Agree with your open questions but also agree that a language-specific / team-specific approach might be best.

Bit of a brain dump, so I apologise if my thoughts came out a bit unorganised. Would be interested to hear if you think I've understood your brief meaningfully.

I'd be happy to start looking at putting together a simple engine that can pull n prompts from an DIBT-language dataset and bounce them off a configurable (by user) endpoint.

davanstrien commented 6 months ago

@kghamilton89, apologies for the delay in getting back.

I'd be happy to start looking at putting together a simple engine that can pull n prompts from an DIBT-language dataset and bounce them off a configurable (by user) endpoint.

This would be very cool! I think even being able to compare a couple of model candidates in this way would be quite valuable.

tranhd95 commented 6 months ago

Hey @davanstrien,

the guys working on FineWeb or more specifically on FineWeb-Edu gave us a very interesting insight about trying the Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models method for determining the education value of a text.

They found out that Llama-3-70b-Instruct alone with additive scale evaluation prompt (instead of rating the text "in one go" with a e.g. 1-5 scale; reward the text with 1 point for each observed category) is more reliable at their task than juries made of several LLMs. Sure, MPEP seems like a much harder task since we're evaluating open answers but I found this approach interesting.

huggingface / data-is-better-together