modelscope / evalscope

A streamlined and customizable framework for efficient large model evaluation and performance benchmarking
https://evalscope.readthedocs.io/en/latest/
Apache License 2.0
263 stars 33 forks source link

输出结果没有分数 #181

Open Leo20100307 opened 2 weeks ago

Leo20100307 commented 2 weeks ago

问题描述 / Issue Description

请简要描述您遇到的问题。 / Please briefly describe the issue you encountered.

本地/root/ChatGLM目录下载的ChatGLM2-6B模型,

使用vllm部署server:

vllm serve /root/ChatGLM --chat-template ./examples/template_chatglm2.jinja --trust_remote_code --use-v2-block-manager

evalscope相关配置:

(evalscope) root@ubuntu:~/evalscope# cat eval_openai_api.yaml eval_backend: OpenCompass eval_config: datasets:

(evalscope) root@ubuntu:~/evalscope# cat example_eval_openai_api.py from evalscope.run import run_task from evalscope.summarizer import Summarizer

def run_eval():

Option 1: Python dictionary

#task_cfg = task_cfg_dict

# Option 2: YAML configuration file
task_cfg = 'eval_openai_api.yaml'

# Option 3: JSON configuration file
# task_cfg = 'eval_openai_api.json'

run_task(task_cfg=task_cfg)
print('>> Start to get the report with summarizer ...')
report_list = Summarizer.get_report_from_cfg(task_cfg)
print(f'\n>> The report list: {report_list}')

run_eval()

使用的工具 / Tools Used

执行的代码或指令 / Code or Commands Executed

请提供您执行的主要代码或指令。 / Please provide the main code or commands you executed. 例如 / For example:

执行测试: python example_eval_openai_api.py

错误日志 / Error Log

请粘贴完整的错误日志或控制台输出。 / Please paste the full error log or console output. 例如 / For example:

dataset version metric mode /root/ChatGLM


--------- 考试 Exam --------- - - - - ceval - - - - cmb - - - - agieval - - - - mmlu - - - - GaokaoBench - - - - ARC-c - - - - ARC-e - - - - --------- 语言 Language --------- - - - - WiC - - - - summedits - - - - chid-dev - - - - afqmc-dev - - - - bustm-dev - - - - cluewsc-dev - - - - WSC - - - - winogrande - - - - flores_100 - - - - --------- 知识 Knowledge --------- - - - - BoolQ - - - - commonsense_qa - - - - nq - - - - triviaqa - - - - --------- 推理 Reasoning --------- - - - - cmnli - - - - ocnli - - - - ocnli_fc-dev - - - - AX_b - - - - AX_g - - - - CB - - - - RTE - - - - story_cloze - - - - COPA - - - - ReCoRD - - - - hellaswag - - - - piqa - - - - siqa - - - - strategyqa - - - - math - - - - gsm8k - - - - TheoremQA - - - - openai_humaneval - - - - mbpp - - - - bbh - - - - --------- 理解 Understanding --------- - - - - C3 - - - - CMRC_dev - - - - DRCD_dev - - - - MultiRC - - - - race-middle - - - - race-high - - - - openbookqa_fact - - - - csl_dev - - - - lcsts - - - - Xsum - - - - eprstmt-dev - - - - lambada - - - - tnews-dev - - - - 11/07 07:06:42 - OpenCompass - INFO - write summary to /root/evalscope/outputs/default/20241107_070629/summary/summary_20241107_070629.txt 11/07 07:06:42 - OpenCompass - INFO - write csv to /root/evalscope/outputs/default/20241107_070629/summary/summary_20241107_070629.csv

Start to get the report with summarizer ... 2024-11-07 07:06:42,022 - evalscope - INFO - **Loading task cfg for summarizer: {'eval_backend': 'OpenCompass', 'eval_config': {'datasets': ['mmlu', 'ceval', 'ARC_c', 'gsm8k'], 'models': [{'openai_api_base': 'http://127.0.0.1:8000/v1/chat/completions', 'path': '/root/ChatGLM', 'temperature': 0.0}]}}

The report list: [{'dataset': '--------- 考试 Exam ---------', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'ceval', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'cmb', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'agieval', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'mmlu', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'GaokaoBench', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'ARC-c', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'ARC-e', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': '--------- 语言 Language ---------', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'WiC', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'summedits', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'chid-dev', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'afqmc-dev', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'bustm-dev', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'cluewsc-dev', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'WSC', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'winogrande', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'flores_100', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': '--------- 知识 Knowledge ---------', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'BoolQ', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'commonsense_qa', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'nq', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'triviaqa', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': '--------- 推理 Reasoning ---------', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'cmnli', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'ocnli', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'ocnli_fc-dev', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'AX_b', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'AX_g', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'CB', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'RTE', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'story_cloze', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'COPA', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'ReCoRD', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'hellaswag', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'piqa', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'siqa', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'strategyqa', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'math', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'gsm8k', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'TheoremQA', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'openai_humaneval', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'mbpp', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'bbh', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': '--------- 理解 Understanding ---------', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'C3', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'CMRC_dev', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'DRCD_dev', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'MultiRC', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'race-middle', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'race-high', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'openbookqa_fact', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'csl_dev', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'lcsts', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'Xsum', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'eprstmt-dev', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'lambada', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}, {'dataset': 'tnews-dev', 'version': '-', 'metric': '-', 'mode': '-', '/root/ChatGLM': '-'}]

运行环境 / Runtime Environment

其他信息 / Additional Information

如果有其他相关信息,请在此处提供。 / If there is any other relevant information, please provide it here.

wangxingjun778 commented 2 weeks ago

请问日志中有error相关字样的log么? 如有则可以进到outputs相对应的logs文件夹中查看对应的error明细 / Please check the error log file in the outputs directory and get details of err msg.

wangxingjun778 commented 2 weeks ago

另外请check一下,评测相关的data是否有预先准备: 参考 https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/opencompass_backend.html

image

Leo20100307 commented 2 weeks ago

请问日志中有error相关字样的log么? 如有则可以进到outputs相对应的logs文件夹中查看对应的error明细 / Please check the error log file in the outputs directory and get details of err msg.

outputs目录下,有个txt文档,里面没有看到报错。日志文件80M,无法上传。

vllm端有打印,模型应该是有接收到请求并做了处理:

image

Leo20100307 commented 2 weeks ago

另外请check一下,评测相关的data是否有预先准备: 参考 https://evalscope.readthedocs.io/zh-cn/latest/user_guides/backend/opencompass_backend.html

image

数据文件已经下载,,并解压到当前目录下,目录名称"data"

Leo20100307 commented 2 weeks ago

image

Leo20100307 commented 2 weeks ago

image

data目录下的数据集文件