[Feature] Improve evaluation scripts for mbpp datasets

open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

https://opencompass.org.cn/

Apache License 2.0

4.23k stars 451 forks source link

[Feature] Improve evaluation scripts for mbpp datasets #933

Open yuhui1038 opened 9 months ago

yuhui1038 commented 9 months ago

Describe the feature

When I evaluated the vicuna-7b-v1.5 model using the mbpp_gen script, the score was 0 and most answers showed failed. Perhaps the evaluate script did not properly format the answer. 微信截图_20240228213314 微信图片_20240228213255

Will you implement it?

[ ] I would like to implement this feature and create a PR!

YFCYFC commented 9 months ago

I met the same error.Here's the result I got on MBPP datasets when evaluating gemma-7b-it. By the way, I want to know where to find the prediction result as you paste.Thank you for your report.

YFCYFC commented 9 months ago

I met the same error.Here's the result I got on MBPP datasets when evaluating gemma-7b-it. By the way, I want to know where to find the prediction result as you paste.Thank you for your report.

I find the prediction file, and each prediction is empty. 截屏2024-03-07 下午6 46 09 I don't know what happened here, because I used the default config for the model and the dataset.Looking forward to helpful findings.

iFe1er commented 8 months ago

I met the same error.Here's the result I got on MBPP datasets when evaluating gemma-7b-it. By the way, I want to know where to find the prediction result as you paste.Thank you for your report.

I find the prediction file, and each prediction is empty. I don't know what happened here, because I used the default config for the model and the dataset.Looking forward to helpful findings.

same question when testing bbh_gen task, the prediction is empty. Have you fixed it? @YFCYFC

YFCYFC commented 8 months ago

I met the same error.Here's the result I got on MBPP datasets when evaluating gemma-7b-it. By the way, I want to know where to find the prediction result as you paste.Thank you for your report.

I find the prediction file, and each prediction is empty. I don't know what happened here, because I used the default config for the model and the dataset.Looking forward to helpful findings.

same question when testing bbh_gen task, the prediction is empty. Have you fixed it? @YFCYFC

Sorry,I did not found the reason exactly.I just fine tuned my model once again, and the predictions were not empty, but still not the desired result, so I didn't post the update here.I suggest you finetune your model for more than 1 epochs, which may help.