[miniLLM] The evaluation might be wrong when using dp_size > 1

cailinhang commented 1 year ago

At the evaluation phase of llama-7b/gpt2-xlarge whose MP_size=1, I try to use 8 gpus to accelerate the evaluation phase. The code is scripts/gpt2/eval/run_eval.sh. I simplify this code to only evaluate on one task. The gpu_num=8 which by default is 1.

base_path=${1-"/home/MiniLLM"}
port=2040

ckpt_base_path=/xx/LMOps/minillm/results/gpt2/train/

for data in alpaca_zh
do
    # Evaluate SFT
    for seed in 10
    do
        ckpt="sft/gpt2-base"
        ckpt=$ckpt_base_path"/"$ckpt

        gpu_num=8 # this is wrong
        gpu_num=1 # this is normal
        bash ${base_path}/scripts/gpt2/eval/eval_main_${data}.sh ${base_path} ${port} ${gpu_num} ${ckpt} --seed $seed  --eval-batch-size 8
    done

If I use gpu_num=1, the evaluation is fine. The final rouge value is normal. But for gpu_num=8, the rouge is much lower than ecpected. And the former rouge is consistent with that of the training-time evaluation rouge.

I check results/gpt2/eval_main/alpaca_zh-512/xxx/answers.jsonl for more details. And I found that there are only 63 lines of responses for the 8 gpu-evaluation setting. But for the 1gpu settting, the line number is 500, which is the exact number of valid set. I think the dp_size >1 might be the cause of this problem.

For llama-13b whose MP_size=4, if I use gpu_num=4, the validation is normal, but wrong if gpu_num=8. My evaluation code of alpaca_zh is very similar to that of dolly. I guess this problem might exist for other dataset like dolly too.

t1101675 commented 1 year ago

Fixed in https://github.com/microsoft/LMOps/commit/81cf69c60c3c69d7693d42c7e985e80e8d864dc6

cailinhang commented 1 year ago

Fixed in 81cf69c

Thanks. The evaluation is much faster with dp_size > 1 now.

A tiny problem is that for the same model (llama-7b) and same dataset (alpaca-zh), when I use eval_batch_size=16, the eval rouge is 19.010. When I use eval_batch_size=8, the eval rouge is 18.1586. Total number of val set is 500.

It would be better if this rouge value is more consistent with different eval_batch_size.

t1101675 commented 1 year ago

To get a more consistent rouge score, you can increase the size of val set or run the evaluation under different random seeds and calculate an average score.

microsoft / LMOps

[miniLLM] The evaluation might be wrong when using dp_size > 1 #101