hy5468 / TransLLM

Why Not Transform Chat Large Language Models to Non-English?
Apache License 2.0
0 stars 0 forks source link

How do you evaluate the model you train? So far I see translate instruction not expected output #1

Open pacozaa opened 3 weeks ago

pacozaa commented 3 weeks ago

Hi TransLLM owner, Do you have like benchmark data where expected output is provided?

Cheers!

hy5468 commented 2 weeks ago

Hi, following llm-as-a-judge, we conduct human evaluation and LLM evaluation without references in most cases, the details can be found in Section 4.1 and Appendix A.4 in our paper. We did not manually construct reference answers for our benchmarks in Thai (MT-Bench and Alpaca-Eval). For MT-Bench and Alpaca-Eval in English, the authors regard the GPT-4 outputs as references. Since our Thai questions are parallel to English questions, we may simply translate these GPT-4's English answers into Thai as references.