[Bug] v0.3.5版本评测Qwen/Qwen2.5-72B得分显著下降

guoshengCS commented 6 days ago

Prerequisite

[X] I have searched Issues and Discussions but cannot get the expected help.
[X] The bug has not been fixed in the latest version.

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

torch==2.2.0+vllm==0.4.0+OpenCompass==0.3.5

Reproduces the problem - code/configuration sample

使用如下评测配置评测Qwen/Qwen2.5-72B'

from mmengine.config import read_base

with read_base():
    from .datasets.collections.leaderboard.qwen import datasets
    from .summarizers.leaderboard import summarizer

from opencompass.models import VLLM, HuggingFaceBaseModel

models = [
    dict(
        type=VLLM,
        abbr='qwen2.5-72b-vllm',
        path='Qwen/Qwen2.5-72B',
        model_kwargs=dict(
            tensor_parallel_size=4,
            gpu_memory_utilization=0.8,  # set this to avoid OOM temporarily
            enforce_eager=True,
        ),
        stop_words=['<|endoftext|>', '<|im_end|>'],
        max_out_len=128,
        max_seq_len=8192,
        batch_size=16,
        generation_kwargs=dict(  # args for vllm.SamplingParams
            temperature=0,  #
        ),
        run_cfg=dict(num_gpus=4),
    )
]

Reproduces the problem - command or script

直接使用run.py运行上面的评测配置文件，部分任务在最新的v0.3.5版本得分较低，相较早先v0.2.5(commit e0d7808)版本得分大幅下降

Reproduces the problem - error message

左为v0.3.5版本得分 vs. 右为早先代码版本得分

Other information

No response

guoshengCS commented 6 days ago

主要是更新最新代码后leaderboard/qwen.py里一些任务的评测分数变化，大概看了下，其中：

math评分，新代码更新了math_postprocess_v2，得分49.88->4.24，这个有留言 https://github.com/open-compass/opencompass/pull/1340#issuecomment-2461201010
humaneval评分，新代码更新了humaneval_postprocess_v2，得分41.46->7.93
ARC-c、ARC-e、openbookqa_fact、AX_b、AX_g、COPA、hellaswag、piqa评分，用的first_option_postprocess的新代码有改动，ARC-c得分88->24，貌似是有个新增pattern的影响另外看里面用match.group(0)好像不太对

tonysy commented 4 days ago

Thanks for the report, we will follow this issue and check the problem.

MaiziXiao commented 3 days ago

第三点已在https://github.com/open-compass/opencompass/pull/1688/files 修复，拉取下最新的代码重新跑一下评估。针对 base 模型，我们后续会发布专门针对基座模型的评测配置

guoshengCS commented 3 days ago

第三点已在https://github.com/open-compass/opencompass/pull/1688/files 修复，拉取下最新的代码重新跑一下评估。针对 base 模型，我们后续会发布专门针对基座模型的评测配置

辛苦修复~ 另外这里使用的评测配置是leaderboard/qwen.py，看还有leaderboard/qwen_chat.py，所以并不是qwen.py给base模型用、qwen_chat.py给chat模型用的吗？当前评测只修改match.group(0)这个的话得分确实还是比较低（ARC-c得分31.86 ）还是有问题，当前没有能比较好给base模型用的评测配置是吗

guoshengCS commented 3 days ago

另外还想问下，咱们新版本对于instruct模型已经默认使用HuggingFacewithChatTemplate/VLLMwithChatTemplate了，但是这些类没有实现get_ppl方法，在评测使用PPLInferencer的数据配置时会报错，instruct模型是预期不支持PPL方式评测吗

File "/checkpoint/binary/train_package/opencompass/models/base.py", line 84, in get_ppl
raise NotImplementedError(f'{self.__class__.__name__} does not support'
NotImplementedError: VLLMwithChatTemplate does not support ppl-based evaluation yet, try gen-based instead.

BIGWangYuDong commented 3 days ago

另外还想问下，咱们新版本对于instruct模型已经默认使用HuggingFacewithChatTemplate/VLLMwithChatTemplate了，但是这些类没有实现get_ppl方法，在评测使用PPLInferencer的数据配置时会报错，instruct模型是预期不支持PPL方式评测吗
File "/checkpoint/binary/train_package/opencompass/models/base.py", line 84, in get_ppl
raise NotImplementedError(f'{self.__class__.__name__} does not support'
NotImplementedError: VLLMwithChatTemplate does not support ppl-based evaluation yet, try gen-based instead.

借楼，对于 HuggingFacewithChatTemplate/VLLMwithChatTemplate 我也有一个疑问，就是 template_parser 从 LMTemplateParser 改成了 APITemplateParser。但是之前有一些 DIY 的配置貌似就不通用了，并且 prediction 里面保存的信息看不到实际传输给模型的全量文本。

LMTemplateParser 和 APITemplateParser 大概看了看源码感觉目前拼接策略好像不太一致，而且存在 api_role 这个强制 key，存在了 BC，比如 begin 和 end 这个地方，可不可以补充和优化一下 meta template 文档

open-compass / opencompass