Closed Orion-Zheng closed 9 months ago
Thanks for your suggestions. About the first problem, we find that most of the evaluated models have few answers that can't be parsed (typically less than 1%), and we also slightly change the parse code for different models to ensure an acceptable parse success rate. So the impact is limited. Comparing the probabilities of A,B,C,D in the first predicted token is also a good approach, but I'm not sure which way is more meaningful. About the second problem, as the model only needs to predict the option (instead of a long response), the prompt template might have a small impact. Maybe more detailed comparisons are needed. What do you think?
The Code of extracting models' choices is unreliable. In the function
process_medium_results
, rules to extract models' response are too weak. Besides, when failing to extract answer, the code will randomly choose a choice as model's choice! This doesn't make sense. It makes the evaluation result not deterministic, and even if a model knows nothing, it can still guess some options correctly.Possible solution: Don't extract answer from character-level, but try to compare the probabilities of A,B,C,D in the first predicted token.
The Prompt Template is not flexible. Different models are supervised fine-tuned on different prompt template. If the evaluation prompt template not consistent with what model is trained on, its performance will be underestimated. The evaluation code didn't give code to accommodate different prompt templates, which can be improved.
I think the SafetyBench is a meaningful work. But the evaluation process is far from perfect. I am working on fixing these problems. After that, I hope you could re-evaluate models and update the benchmark. Do you think it's a good idea?