BUG - Githubissues

nyBball commented 9 months ago

我目前使用chatglm3-6b模型进行评测，我使用的是4个A800显卡（80G），在跑v0.2版本的代码时，部分数据集跑不通报错。详情如下：

对于instruct、review、plan json数据集评测正常，但是对于plan str、retrieve str，在运行到中途的时候会报如下错误：

Traceback (most recent call last):
  File "/home/ma-user/modelarts/user-job-dir/T-Eval/v0.2/test.py", line 109, in <module>
    prediction = infer(dataset, llm, args.out_dir, tmp_folder_name=tmp_folder_name, test_num=test_num)
  File "/home/ma-user/modelarts/user-job-dir/T-Eval/v0.2/test.py", line 74, in infer
    prediction = split_special_tokens(prediction)
  File "/home/ma-user/modelarts/user-job-dir/T-Eval/v0.2/test.py", line 50, in split_special_tokens
    text = text.split('<eoa>')[0]
AttributeError: 'dict' object has no attribute 'split'

对于reason str、understand str、RRU，在运行到中途的时候会报如下错误：

Input length of input_ids is 8463, but `max_length` is set to 8192. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.
CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
A decoder-only architecture is being used, but right-padding was detected! For correct generation results, please set `padding_side='left'` when initializing the tokenizer.
CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

请问作者有什么建议吗？多谢

zehuichen123 commented 9 months ago

对于第一个问题，需要把meta_template设成chatglm，chatglm3有个feature就是如果第一条message是system的时候返回的结果会eval一下变成dict，就不是string了第二个问题try except一下把...chatglm3超长了就是会报错...直接让返回为空吧...

nyBball commented 9 months ago

多谢~

我的启动命令是 sh test_all_zh.sh hf ../../ckpt/chatglm3-6b/ chatglm3-6b-zh chatglm，应该已经把meta_template设成chatglm了吧？并且只是在部分数据集报第一个错，应该不是meta_template没设成chatglm的原因？

zehuichen123 commented 9 months ago

话说方便说一下具体是哪条数据呀，我这边试一下～

nyBball commented 9 months ago

plan str:

retrieve str:

zehuichen123 commented 9 months ago

我hack了一下现在应该是能正常跑通的, 主要是他们自己写了个这个代码 (https://huggingface.co/THUDM/chatglm3-6b/blob/f30825950ce00cb0577bf6a15e0d95de58e328dc/modeling_chatglm.py#L1021)

nyBball commented 9 months ago

你的这套代码bug太多了，建议先都check下，严谨一些吧。。。

sh test_all_zh.sh hf ../../ckpt/internlm-7b/ internlm-7b-zh internlm

evaluating understand str报错

sh test_all_en.sh hf ../../ckpt/internlm-7b/ internlm-7b-en internlm

evaluating understand str报错

zehuichen123 commented 9 months ago

不好意思哈因为论文走的是opencompass 这套代码是我们另外写的现在还处在验证阶段一些回归测试还没来得及跑今天会整体测一下～有bug请多见谅

zehuichen123 commented 9 months ago

@nyBball 代码已经更新，您这边可以再试试，但是对齐精度可能需要等opencompass那边也ready后才能完全对齐哈这个是chatglm3-6b的infer结果

Overall: 49.8   Instruct: 80.1  Plan: 40.1      Reason: 28.5    Retrieve: 48.6  Understand: 51.7        Review: 50.1

chatglm3-6b似乎会存在一些内部错误，走到try except逻辑里面，存在结果偏低的现象，这个我们后续会再看看

zehuichen123 commented 9 months ago

这个是Qwen-7B的infer结果，应该是正常的

Overall: 58.6   Instruct: 66.6  Plan: 55.2      Reason: 47.7    Retrieve: 67.1  Understand: 51.9        Review: 63.2

nyBball commented 9 months ago

英文只支持单卡跑吗？另外请问一下，跑的模型是base版本还是chat版本？

zehuichen123 commented 9 months ago

这个您自己注释掉就行了哈～所有的跑的都是chat model

nyBball commented 9 months ago

这个是Qwen-7B的infer结果，应该是正常的
Overall: 58.6   Instruct: 66.6  Plan: 55.2      Reason: 47.7    Retrieve: 67.1  Understand: 51.9        Review: 63.2

我跑的qwen-7b-chat （https://huggingface.co/Qwen/Qwen-7B-Chat）

数据集用的1/5 subset （https://drive.google.com/file/d/1DgCMjquEIJ2v14Xu6uB6w3UEzaYXZbUL/view）

跑的结果是 Overall: 64.0 Instruct: 93.3 Plan: 56.8 Reason: 59.7 Retrieve: 67.3 Understand: 48.7 Review: 58.1

跟你的这个结果差的有点大，估计是什么原因呢？谢谢

zehuichen123 commented 9 months ago

之前那个是full-set的我昨天跑了一个1/5susbet的结果

Overall: 58.7   Instruct: 64.1  Plan: 59.6      Reason: 49.3    Retrieve: 66.2  Understand: 51.4        Review: 61.9

你可以更新一下lagent和t-eval的代码，感觉主要是instruct你的点比较高？你测了str格式的instruct了吗我这边json format的instruct能到90+，但是str format的比较低

open-compass / T-Eval

BUG #25