OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
0%| | 0/4 [00:00<?, ?it/s]
25%|██▌ | 1/4 [02:59<08:58, 179.64s/it]
50%|█████ | 2/4 [03:06<02:35, 77.79s/it]
100%|██████████| 4/4 [11:20<00:00, 184.08s/it]
100%|██████████| 4/4 [11:20<00:00, 170.19s/it]
launch OpenICLInfer[opencompass.models.huggingface.HuggingFace_opensource_Qwen1.5-0.5B/lukaemon_mmlu_professional_law_1] on GPU 1
launch OpenICLInfer[opencompass.models.huggingface.HuggingFace_opensource_Qwen1.5-0.5B/lukaemon_mmlu_professional_law_0] on GPU 0
......
04/18 17:38:09 - OpenCompass - INFO - Partitioned into 57 tasks.
0%| | 0/57 [00:06<?, ?it/s]
100%|██████████| 57/57 [03:15<00:00, 3.43s/it]
launch OpenICLEval[opencompass.models.huggingface.HuggingFace_opensource_Qwen1.5-0.5B/lukaemon_mmlu_college_biology] on CPU
launch OpenICLEval[opencompass.models.huggingface.HuggingFace_opensource_Qwen1.5-0.5B/lukaemon_mmlu_college_chemistry] on CPU
launch OpenICLEval[opencompass.models.huggingface.HuggingFace_opensource_Qwen1.5-0.5B/lukaemon_mmlu_college_computer_science] on CPU
......
launch OpenICLEval[opencompass.models.huggingface.HuggingFace_opensource_Qwen1.5-0.5B/lukaemon_mmlu_conceptual_physics] on CPU
launch OpenICLEval[opencompass.models.huggingface.HuggingFace_opensource_Qwen1.5-0.5B/lukaemon_mmlu_us_foreign_policy] on CPU
Traceback (most recent call last):
File "/home/common/code/opencompass/opencompass/summarizers/default.py", line 209, in _calculate_group_metrics
numerator = sum(scores[metric][k] * sg['weights'][k] for k in sg['weights'] if sg['weights'][k] != 0)
File "/home/common/code/opencompass/opencompass/summarizers/default.py", line 209, in <genexpr>
numerator = sum(scores[metric][k] * sg['weights'][k] for k in sg['weights'] if sg['weights'][k] != 0)
KeyError: 'lukaemon_mmlu_abstract_algebra'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/common/code/opencompass/run.py", line 4, in <module>
main()
File "/home/common/code/opencompass/opencompass/cli/main.py", line 360, in main
summarizer.summarize(time_str=cfg_time_str)
File "/home/common/code/opencompass/opencompass/summarizers/default.py", line 338, in summarize
self._calculate_group_metrics(raw_results, parsed_results, dataset_metrics, dataset_eval_mode)
File "/home/common/code/opencompass/opencompass/summarizers/default.py", line 211, in _calculate_group_metrics
tmp_scores = {metric: {k.split('@')[0]: v for k, v in scores[metric].items()} for metric in scores}
File "/home/common/code/opencompass/opencompass/summarizers/default.py", line 211, in <dictcomp>
tmp_scores = {metric: {k.split('@')[0]: v for k, v in scores[metric].items()} for metric in scores}
AttributeError: 'float' object has no attribute 'items'
Other information
Other datasets are evaluated without this problems, and this error is reported with mmlu datasets.
I found that the inference and evaluation process could be carried out normally, and some result folders and ppl related data were generated, but some problems occurred when the results were summarized.
Prerequisite
Type
I'm evaluating with the officially supported tasks/models/datasets.
Environment
conda env is the official recommendation
Reproduces the problem - code/configuration sample
Use shell script to start, the model is Qwen1.5-1.8b
Reproduces the problem - command or script
Reproduces the problem - error message
this is eval logs
Other information
Other datasets are evaluated without this problems, and this error is reported with mmlu datasets. I found that the inference and evaluation process could be carried out normally, and some result folders and ppl related data were generated, but some problems occurred when the results were summarized.