open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
https://opencompass.org.cn/
Apache License 2.0
4.21k stars 449 forks source link

Gaokao and some datasets appear many zero when I evaluate them [Bug] #480

Open kkwhale7 opened 1 year ago

kkwhale7 commented 1 year ago

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

latest environment

Reproduces the problem - code/configuration sample

1

Reproduces the problem - command or script

CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' python run.py --hf-path /Llama-2-7b-hf --datasets gsm8k_gen_1d7fe4 bbh_gen math_gen_265cce GaokaoBench_gen_5cfe9e agieval_gen_a0c741 --model-kwargs device_map='auto' --tokenizer-kwargs padding_side='left' truncation='left' use_fast=False --max-out-len 100 --max-seq-len 4096 --batch-size 8 --no-batch-padding --num-gpus 1 --max-partition-size 15000

Reproduces the problem - error message

image appear too much zero ; and in my evaluation, result of Gaokao for llmam2-7b-hf is 7.06 which is not consistent with 18.9 in https://opencompass.org.cn/leaderboard-llm thank u and much appreciate

Other information

i want to reproduce 18.9 in your website !!

kkwhale7 commented 1 year ago

@tonysy @lvhan028 @so2liu @cdpath i need your help!!

kkwhale7 commented 1 year ago

@Leymore

kkwhale7 commented 1 year ago

You haven't implemented the evaluation logic for subjective questions, why are the values displayed on the official website different from ours image

tonysy commented 1 year ago

We only include the objective questions of Gaokao in OpenCompass

kkwhale7 commented 1 year ago

We only include the objective questions of Gaokao in OpenCompass

but your score in your website 18.9 in GAOKAO image we cant reproduce it!

kkwhale7 commented 1 year ago

in my way, I only calculate the objective score 15.13 image

kirliavc commented 1 year ago

The MCQ problems select one answer from 4 choices. The result is meaningless when smaller than 25%.

kkwhale7 commented 1 year ago

The MCQ problems select one answer from 4 choices. The result is meaningless when smaller than 25%.

I got it.So do you directly ignore the scores of multiple topic selection or only calculate the parts greater than 25

kkwhale7 commented 1 year ago

I have discovered a new phenomenon where the predictions generated by the gen task in GAOKAO are also inconsistent, somehow in ZeroRetriever, all below are with llama2-7b models image image

tonysy commented 1 year ago

We will review this problem, more information and logs will be provided later.

Leymore commented 1 year ago

Detailed scores can be found here: https://opencompass.org.cn/dataset-detail/GAOKAO-Bench

The average score is weighted by the total scores from each individual subjects. We do NOT ignore the scores below 25.0!

As for the failure on following the instruction by llama-2-7b,we think this is totally understandable. We implement the postprocess over here: https://github.com/open-compass/opencompass/blob/main/opencompass/datasets/GaokaoBench.py . We depend the final result on the result of the this postprocess.

kkwhale7 commented 1 year ago

thank you for your patience. I know your score calculation method now. But why are the two predictions different when I have the same config, This postprocess method only answers the first ABCD character from the back to the front of the prediction, but the prediction is still inconsistent with other test https://github.com/open-compass/opencompass/issues/480#issuecomment-1765588411

kkwhale7 commented 1 year ago

I have discovered a new phenomenon where the predictions generated by the gen task in GAOKAO are also inconsistent, somehow in ZeroRetriever, all below are with llama2-7b models image image So why are the two prediction results inconsistent using the gen approach ?

tonysy commented 1 year ago

@kkwhale7 Hey, does that still exist?

kkwhale7 commented 1 year ago

@kkwhale7 Hey, does that still exist?

Yes, when I only calculated the average score of the subjects you specified, Baihuan2-7b-base scored only 17.33 on GAOKAO, while it was declared 34.8 on your official website. image image i cant reproduce it with your latest version