How do you evaluate the model you train? So far I see translate instruction not expected output

Hi, following llm-as-a-judge, we conduct human evaluation and LLM evaluation without references in most cases, the details can be found in Section 4.1 and Appendix A.4 in our paper. We did not manually construct reference answers for our benchmarks in Thai (MT-Bench and Alpaca-Eval). For MT-Bench and Alpaca-Eval in English, the authors regard the GPT-4 outputs as references. Since our Thai questions are parallel to English questions, we may simply translate these GPT-4's English answers into Thai as references.

hy5468 / TransLLM

How do you evaluate the model you train? So far I see translate instruction not expected output #1