Open pacozaa opened 2 months ago
Hi, following llm-as-a-judge, we conduct human evaluation and LLM evaluation without references in most cases, the details can be found in Section 4.1 and Appendix A.4 in our paper. We did not manually construct reference answers for our benchmarks in Thai (MT-Bench and Alpaca-Eval). For MT-Bench and Alpaca-Eval in English, the authors regard the GPT-4 outputs as references. Since our Thai questions are parallel to English questions, we may simply translate these GPT-4's English answers into Thai as references.
Hi TransLLM owner, Do you have like benchmark data where expected output is provided?
Cheers!