Open Luoyang144 opened 8 months ago
Hi, thanks for asking! Details about evaluation can be found in https://aclanthology.org/2023.findings-emnlp.242.pdf
Thanks for reply. In section 3.1, the tile is "Automatic Evaluation with GPT-4", but I didn't see the evaluation details. Have you evaluated the results of all the test data? This will require a significant amount of time (and money).
I see, GPT4 is only used for evaluating with Question-80.
More details can be found here but I believe they have upgraded many things since I used it.
Thanks for reply. I wonder if EM may cause misjudgment during evaluation? This situation seems unavoidable
Hello, I am curious about how SAIL was evaluated, and was it evaluated using GPT4? Did all benchmark data be used for evaluation?