open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
https://opencompass.org.cn/
Apache License 2.0
3.73k stars 400 forks source link

[Bug] mmlu datasets evaluation failed #1059

Closed wdndev closed 4 months ago

wdndev commented 4 months ago

Prerequisite

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

conda env is the official recommendation

Reproduces the problem - code/configuration sample

Use shell script to start, the model is Qwen1.5-1.8b

Reproduces the problem - command or script

python run.py
 --datasets  mmlu_ppl_ac766d     
--hf-path /home/common/ckpt/opensource/Qwen1.5-0.5B     
--tokenizer-path /home/common/ckpt/opensource/Qwen1.5-0.5B     
--tokenizer-kwargs padding_side='left' truncation='left' trust_remote_code=True use_fast=False    
--model-kwargs device_map='auto' trust_remote_code=True     
--batch-size 10      
--num-gpus 1      
--max-partition-size 40000     
--max-workers-per-gpu 1     
--summarizer leaderboard.py     
--dump-eval-details    
 -w /home/common/eval/opencompass/

Reproduces the problem - error message

this is eval logs

  0%|          | 0/4 [00:00<?, ?it/s]
 25%|██▌       | 1/4 [02:59<08:58, 179.64s/it]
 50%|█████     | 2/4 [03:06<02:35, 77.79s/it] 
100%|██████████| 4/4 [11:20<00:00, 184.08s/it]
100%|██████████| 4/4 [11:20<00:00, 170.19s/it]
launch OpenICLInfer[opencompass.models.huggingface.HuggingFace_opensource_Qwen1.5-0.5B/lukaemon_mmlu_professional_law_1] on GPU 1
launch OpenICLInfer[opencompass.models.huggingface.HuggingFace_opensource_Qwen1.5-0.5B/lukaemon_mmlu_professional_law_0] on GPU 0
......
04/18 17:38:09 - OpenCompass - INFO - Partitioned into 57 tasks.
  0%|          | 0/57 [00:06<?, ?it/s]
100%|██████████| 57/57 [03:15<00:00,  3.43s/it]
launch OpenICLEval[opencompass.models.huggingface.HuggingFace_opensource_Qwen1.5-0.5B/lukaemon_mmlu_college_biology] on CPU 
launch OpenICLEval[opencompass.models.huggingface.HuggingFace_opensource_Qwen1.5-0.5B/lukaemon_mmlu_college_chemistry] on CPU 
launch OpenICLEval[opencompass.models.huggingface.HuggingFace_opensource_Qwen1.5-0.5B/lukaemon_mmlu_college_computer_science] on CPU 
......
launch OpenICLEval[opencompass.models.huggingface.HuggingFace_opensource_Qwen1.5-0.5B/lukaemon_mmlu_conceptual_physics] on CPU 
launch OpenICLEval[opencompass.models.huggingface.HuggingFace_opensource_Qwen1.5-0.5B/lukaemon_mmlu_us_foreign_policy] on CPU 
Traceback (most recent call last):
  File "/home/common/code/opencompass/opencompass/summarizers/default.py", line 209, in _calculate_group_metrics
    numerator = sum(scores[metric][k] * sg['weights'][k] for k in sg['weights'] if sg['weights'][k] != 0)
  File "/home/common/code/opencompass/opencompass/summarizers/default.py", line 209, in <genexpr>
    numerator = sum(scores[metric][k] * sg['weights'][k] for k in sg['weights'] if sg['weights'][k] != 0)
KeyError: 'lukaemon_mmlu_abstract_algebra'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/common/code/opencompass/run.py", line 4, in <module>
    main()
  File "/home/common/code/opencompass/opencompass/cli/main.py", line 360, in main
    summarizer.summarize(time_str=cfg_time_str)
  File "/home/common/code/opencompass/opencompass/summarizers/default.py", line 338, in summarize
    self._calculate_group_metrics(raw_results, parsed_results, dataset_metrics, dataset_eval_mode)
  File "/home/common/code/opencompass/opencompass/summarizers/default.py", line 211, in _calculate_group_metrics
    tmp_scores = {metric: {k.split('@')[0]: v for k, v in scores[metric].items()} for metric in scores}
  File "/home/common/code/opencompass/opencompass/summarizers/default.py", line 211, in <dictcomp>
    tmp_scores = {metric: {k.split('@')[0]: v for k, v in scores[metric].items()} for metric in scores}
AttributeError: 'float' object has no attribute 'items'

Other information

Other datasets are evaluated without this problems, and this error is reported with mmlu datasets. I found that the inference and evaluation process could be carried out normally, and some result folders and ppl related data were generated, but some problems occurred when the results were summarized.

bittersweet1999 commented 4 months ago

I think it may be caused by the wrong summarizer, you can try to not use --summarizer leaderboard.py as a temporary solution

wdndev commented 4 months ago

thank you! i try it