Open davanstrien opened 7 months ago
Maybe adding independant metrics can help ranking the judges ? For instance backtranslation + similarities or standard descriptive metrics if some of them correlate?
@tiptales Yeah, that could be useful for sure (and even some simpler functions to check language of generations etc.)
Hi @davanstrien, this looks to be a really cool continuation of MPEP. Especially for languages that already have a fair number of models available.
I suppose there are a handful of deliverables to generalise the process:
Agree with your open questions but also agree that a language-specific / team-specific approach might be best.
Bit of a brain dump, so I apologise if my thoughts came out a bit unorganised. Would be interested to hear if you think I've understood your brief meaningfully.
I'd be happy to start looking at putting together a simple engine that can pull n prompts from an DIBT-language dataset and bounce them off a configurable (by user) endpoint.
@kghamilton89, apologies for the delay in getting back.
I'd be happy to start looking at putting together a simple engine that can pull n prompts from an DIBT-language dataset and bounce them off a configurable (by user) endpoint.
This would be very cool! I think even being able to compare a couple of model candidates in this way would be quite valuable.
Hey @davanstrien,
the guys working on FineWeb or more specifically on FineWeb-Edu gave us a very interesting insight about trying the Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models method for determining the education value of a text.
They found out that Llama-3-70b-Instruct alone with additive scale evaluation prompt (instead of rating the text "in one go" with a e.g. 1-5 scale; reward the text with 1 point for each observed category) is more reliable at their task than juries made of several LLMs. Sure, MPEP seems like a much harder task since we're evaluating open answers but I found this approach interesting.
Evaluating (open) LLMs for more languages using translated prompts
As part of MPEP, we are now at the point where some translation efforts have successfully translated 500 highly ranked prompts into a new target language. We can do other things with these translated prompts, but our first next step is to use them to evaluate the performance of LLMs for a particular language.
Does LLM as a judge work outside of English?
Many people use LLMs as judges to evaluate the performance of open LLMs without requiring human input. https://github.com/tatsu-lab/alpaca_eval is a well-known example of this approach, and they have demonstrated high agreement with human rankings. However, most of these approaches are targeted towards evaluating models in English. What happens when we want to evaluate non-English models?
Ideally, it would still be compelling to leverage LLMs to judge models for non-English since this lowers the barrier to evaluating models significantly (although it doesn't remove this barrier altogether).
What we want to know is:
A possible approach
For each language with 500 translated prompts we roughly want to do the following:
We can then do the following:
Open questions?
Other ideas
Could an approach like Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models with the same SOA models for a particular language work? i.e., choose 4 of the best open LLMs for Arabic and use those at the pool of raters rather than relying on one powerful judge for Arabic?