Please benchmark the model with MT-Bench-DE

h3ndrik commented 11 months ago

So we can compare it to models like the https://huggingface.co/LeoLM/leo-hessianai-13b-chat

I think they mentioned their approach here (in "Evaluation and Results"): https://laion.ai/blog/leo-lm/

jphme commented 11 months ago

So we can compare it to models like the https://huggingface.co/LeoLM/leo-hessianai-13b-chat

I think they mentioned their approach here (in "Evaluation and Results"): https://laion.ai/blog/leo-lm/

Yes, I am in contact with the LeoLM team we are talking about creating some kind of unified German benchmark. As i wrote in the Readme, I also created some German multiple choice benchmarks (you can find the code here and run them).

Unfortunately, all benchmarks I know still have some major weaknesses, especially because it´s difficult to test tonality/and language specifics with them and the results are often misleading and/or need major customization and tests (and for MT Bench I didn't see code that I could run easily). Thats why I didnt include benchmark scores in the release.

Long story short, if you give me some read-to-run code for a benchmark you find meaningful, Im happy to run it; apart from that I am working on a better German Benchmark suite (including an improved MT Bench DE version and maybe even a customized eval model), but this will probably take some time as I am currently focussed on a few other things. Happy about any collaboration in that area!

jphme commented 11 months ago

Short update on this: There is some work on the benchmark side and we will probably have significant better benchmarks (and maybe even some kind of Leaderboard) for non-english models soon, stay tuned. If you have any specific questions (or suggestions for evals) in the meantime, feel free to contact me.

h3ndrik commented 11 months ago

Thanks. Awesome! Yeah, I'm not an expert on this. It's unlikely I can contribute anything if substance. I'm going to stick around.

jphme / EM_German

Please benchmark the model with MT-Bench-DE #4