Several Problems about Current Implementation

The Code of extracting models' choices is unreliable. In the function process_medium_results, rules to extract models' response are too weak. Besides, when failing to extract answer, the code will randomly choose a choice as model's choice! This doesn't make sense. It makes the evaluation result not deterministic, and even if a model knows nothing, it can still guess some options correctly.

Possible solution: Don't extract answer from character-level, but try to compare the probabilities of A,B,C,D in the first predicted token.
The Prompt Template is not flexible. Different models are supervised fine-tuned on different prompt template. If the evaluation prompt template not consistent with what model is trained on, its performance will be underestimated. The evaluation code didn't give code to accommodate different prompt templates, which can be improved.

I think the SafetyBench is a meaningful work. But the evaluation process is far from perfect. I am working on fixing these problems. After that, I hope you could re-evaluate models and update the benchmark. Do you think it's a good idea?

thu-coai / SafetyBench

Several Problems about Current Implementation #1