Closed cailinhang closed 1 year ago
Fixed in 81cf69c
Thanks. The evaluation is much faster with dp_size > 1 now.
A tiny problem is that for the same model (llama-7b) and same dataset (alpaca-zh), when I use eval_batch_size=16, the eval rouge is 19.010
. When I use eval_batch_size=8, the eval rouge is 18.1586
. Total number of val set is 500.
It would be better if this rouge value is more consistent with different eval_batch_size.
To get a more consistent rouge score, you can increase the size of val set or run the evaluation under different random seeds and calculate an average score.
At the evaluation phase of llama-7b/gpt2-xlarge whose
MP_size=1
, I try to use 8 gpus to accelerate the evaluation phase. The code isscripts/gpt2/eval/run_eval.sh
. I simplify this code to only evaluate on one task. Thegpu_num=8
which by default is1
.If I use
gpu_num=1
, the evaluation is fine. The final rouge value is normal. But forgpu_num=8
, the rouge is much lower than ecpected. And the former rouge is consistent with that of the training-time evaluation rouge.I check
results/gpt2/eval_main/alpaca_zh-512/xxx/answers.jsonl
for more details. And I found that there are only 63 lines of responses for the 8 gpu-evaluation setting. But for the 1gpu settting, the line number is 500, which is the exact number of valid set. I think the dp_size >1 might be the cause of this problem.For llama-13b whose
MP_size=4
, if I usegpu_num=4
, the validation is normal, but wrong ifgpu_num=8
. My evaluation code of alpaca_zh is very similar to that of dolly. I guess this problem might exist for other dataset like dolly too.