open-compass / T-Eval

[ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step
https://open-compass.github.io/T-Eval/
Apache License 2.0
235 stars 15 forks source link

BUG #25

Open nyBball opened 9 months ago

nyBball commented 9 months ago

我目前使用chatglm3-6b模型进行评测,我使用的是4个A800显卡(80G),在跑v0.2版本的代码时,部分数据集跑不通报错。详情如下:

对于instruct、review、plan json数据集评测正常,但是对于plan str、retrieve str,在运行到中途的时候会报如下错误:

Traceback (most recent call last):
  File "/home/ma-user/modelarts/user-job-dir/T-Eval/v0.2/test.py", line 109, in <module>
    prediction = infer(dataset, llm, args.out_dir, tmp_folder_name=tmp_folder_name, test_num=test_num)
  File "/home/ma-user/modelarts/user-job-dir/T-Eval/v0.2/test.py", line 74, in infer
    prediction = split_special_tokens(prediction)
  File "/home/ma-user/modelarts/user-job-dir/T-Eval/v0.2/test.py", line 50, in split_special_tokens
    text = text.split('<eoa>')[0]
AttributeError: 'dict' object has no attribute 'split'

对于reason str、understand str、RRU,在运行到中途的时候会报如下错误:

Input length of input_ids is 8463, but `max_length` is set to 8192. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

请问作者有什么建议吗?多谢

zehuichen123 commented 9 months ago

对于第一个问题,需要把meta_template设成chatglm,chatglm3有个feature就是如果第一条message是system的时候返回的结果会eval一下变成dict,就不是string了 第二个问题try except一下把...chatglm3超长了就是会报错...直接让返回为空吧...

nyBball commented 9 months ago

多谢~

我的启动命令是 sh test_all_zh.sh hf ../../ckpt/chatglm3-6b/ chatglm3-6b-zh chatglm,应该已经把meta_template设成chatglm了吧?并且只是在部分数据集报第一个错,应该不是meta_template没设成chatglm的原因?

zehuichen123 commented 9 months ago

话说方便说一下具体是哪条数据呀,我这边试一下~

nyBball commented 9 months ago

plan str:

image

retrieve str:

image

zehuichen123 commented 9 months ago

我hack了一下 现在应该是能正常跑通的, 主要是他们自己写了个这个代码 (https://huggingface.co/THUDM/chatglm3-6b/blob/f30825950ce00cb0577bf6a15e0d95de58e328dc/modeling_chatglm.py#L1021)

nyBball commented 9 months ago

你的这套代码bug太多了,建议先都check下,严谨一些吧。。。

  1. sh test_all_zh.sh hf ../../ckpt/internlm-7b/ internlm-7b-zh internlm

    evaluating understand str报错

image

  1. sh test_all_en.sh hf ../../ckpt/internlm-7b/ internlm-7b-en internlm

    evaluating understand str报错

image

zehuichen123 commented 9 months ago

不好意思哈 因为论文走的是opencompass 这套代码是我们另外写的 现在还处在验证阶段 一些回归测试还没来得及跑 今天会整体测一下~有bug请多见谅

zehuichen123 commented 9 months ago

@nyBball 代码已经更新,您这边可以再试试,但是对齐精度可能需要等opencompass那边也ready后才能完全对齐哈 这个是chatglm3-6b的infer结果

Overall: 49.8   Instruct: 80.1  Plan: 40.1      Reason: 28.5    Retrieve: 48.6  Understand: 51.7        Review: 50.1

chatglm3-6b似乎会存在一些内部错误,走到try except逻辑里面,存在结果偏低的现象,这个我们后续会再看看

zehuichen123 commented 9 months ago

这个是Qwen-7B的infer结果,应该是正常的

Overall: 58.6   Instruct: 66.6  Plan: 55.2      Reason: 47.7    Retrieve: 67.1  Understand: 51.9        Review: 63.2
nyBball commented 9 months ago

英文只支持单卡跑吗?另外请问一下,跑的模型是base版本还是chat版本? image

zehuichen123 commented 9 months ago

这个您自己注释掉就行了哈~所有的跑的都是chat model

nyBball commented 9 months ago

这个是Qwen-7B的infer结果,应该是正常的

Overall: 58.6   Instruct: 66.6  Plan: 55.2      Reason: 47.7    Retrieve: 67.1  Understand: 51.9        Review: 63.2

我跑的qwen-7b-chat (https://huggingface.co/Qwen/Qwen-7B-Chat

数据集用的1/5 subset (https://drive.google.com/file/d/1DgCMjquEIJ2v14Xu6uB6w3UEzaYXZbUL/view

跑的结果是 Overall: 64.0 Instruct: 93.3 Plan: 56.8 Reason: 59.7 Retrieve: 67.3 Understand: 48.7 Review: 58.1

跟你的这个结果差的有点大,估计是什么原因呢?谢谢

zehuichen123 commented 9 months ago

之前那个是full-set的 我昨天跑了一个1/5susbet的结果

Overall: 58.7   Instruct: 64.1  Plan: 59.6      Reason: 49.3    Retrieve: 66.2  Understand: 51.4        Review: 61.9

你可以更新一下lagent和t-eval的代码,感觉主要是instruct你的点比较高?你测了str格式的instruct了吗 我这边json format的instruct能到90+,但是str format的比较低