Open kkwhale7 opened 1 year ago
@tonysy @lvhan028 @so2liu @cdpath i need your help!!
@Leymore
You haven't implemented the evaluation logic for subjective questions, why are the values displayed on the official website different from ours
We only include the objective questions of Gaokao in OpenCompass
We only include the objective questions of Gaokao in OpenCompass
but your score in your website 18.9 in GAOKAO we cant reproduce it!
in my way, I only calculate the objective score 15.13
The MCQ problems select one answer from 4 choices. The result is meaningless when smaller than 25%.
The MCQ problems select one answer from 4 choices. The result is meaningless when smaller than 25%.
I got it.So do you directly ignore the scores of multiple topic selection or only calculate the parts greater than 25
I have discovered a new phenomenon where the predictions generated by the gen task in GAOKAO are also inconsistent, somehow in ZeroRetriever, all below are with llama2-7b models
We will review this problem, more information and logs will be provided later.
Detailed scores can be found here: https://opencompass.org.cn/dataset-detail/GAOKAO-Bench
The average score is weighted by the total scores from each individual subjects. We do NOT ignore the scores below 25.0!
As for the failure on following the instruction by llama-2-7b,we think this is totally understandable. We implement the postprocess over here: https://github.com/open-compass/opencompass/blob/main/opencompass/datasets/GaokaoBench.py . We depend the final result on the result of the this postprocess.
thank you for your patience. I know your score calculation method now. But why are the two predictions different when I have the same config, This postprocess method only answers the first ABCD character from the back to the front of the prediction, but the prediction is still inconsistent with other test https://github.com/open-compass/opencompass/issues/480#issuecomment-1765588411
I have discovered a new phenomenon where the predictions generated by the gen task in GAOKAO are also inconsistent, somehow in ZeroRetriever, all below are with llama2-7b models So why are the two prediction results inconsistent using the gen approach ?
@kkwhale7 Hey, does that still exist?
@kkwhale7 Hey, does that still exist?
Yes, when I only calculated the average score of the subjects you specified, Baihuan2-7b-base scored only 17.33 on GAOKAO, while it was declared 34.8 on your official website. i cant reproduce it with your latest version
Prerequisite
Type
I'm evaluating with the officially supported tasks/models/datasets.
Environment
latest environment
Reproduces the problem - code/configuration sample
1
Reproduces the problem - command or script
CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' python run.py --hf-path /Llama-2-7b-hf --datasets gsm8k_gen_1d7fe4 bbh_gen math_gen_265cce GaokaoBench_gen_5cfe9e agieval_gen_a0c741 --model-kwargs device_map='auto' --tokenizer-kwargs padding_side='left' truncation='left' use_fast=False --max-out-len 100 --max-seq-len 4096 --batch-size 8 --no-batch-padding --num-gpus 1 --max-partition-size 15000
Reproduces the problem - error message
appear too much zero ; and in my evaluation, result of Gaokao for llmam2-7b-hf is 7.06 which is not consistent with 18.9 in https://opencompass.org.cn/leaderboard-llm thank u and much appreciate
Other information
i want to reproduce 18.9 in your website !!