microsoft / LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs
https://aka.ms/GeneralAI
MIT License
3.73k stars 283 forks source link

[minillm] how to eval sft/llama-13B with 1 A100 GPU or 4 A10 GPUs? #58

Closed SleepEarlyLiveLong closed 1 year ago

SleepEarlyLiveLong commented 1 year ago

I want to run scripts/llama/eval/eval_main_dolly.sh to evaluate sft/llama-13B, I have access to 1 A100 gpu OR 4 A10 gpus, how should I modify the scripts/llama/eval/eval_main_dolly.sh file to get it work? I tried the following order on 1 A100 gpu:

python evaluate.py --base-path /data/LMOps/minillm --model-path checkpoints/llama/train/sft/llama-13B/ --ckpt-name sft/llama-13B --n-gpu 1 --model-type llama --data-dir /data/LMOps/minillm/data/dolly --data-names dolly --num-workers 0 --dev-num -1 --data-process-workers -1 --json-data --eval-batch-size 8 --max-length 512 --max-prompt-length 256 --do-eval --save /data/LMOps/minillm/checkpoints/llama/eval_main/ --seed 10 --deepspeed --deepspeed_config /data/LMOps/minillm/configs/deepspeed/ds_config.json --type eval_main --do-sample --top-k 0 --top-p 1.0 --temperature 1.0

and files in checkpoints/llama/train/sft/llama-13B/ are: image where pytorch_model.bin is converted from mp4/ using the released file tools/convert_mp.py however, it gives bugs as following:

Traceback (most recent call last):
  File "/nlp_data/work/chentianyang/minillm/evaluate.py", line 145, in <module>
    main()
  File "/nlp_data/work/chentianyang/minillm/evaluate.py", line 138, in main
    evaluate_main(args, tokenizer, model, dataset["test"], "test", 0, device)       # eval core code
  File "/nlp_data/work/chentianyang/minillm/evaluate_main.py", line 161, in evaluate_main
    lm_loss, query_ids, response_ids, t_used_avg = run_model(args, tokenizer, model, dataset, epoch, device)    # lm_loss: 整个test集500个句子的average loss
  File "/nlp_data/work/chentianyang/minillm/evaluate_main.py", line 118, in run_model
    gen_out = model.generate(
  File "/root/anaconda3/envs/lmops/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/nlp_data/work/chentianyang/minillm/transformers/src/transformers/generation/utils.py", line 1454, in generate
    return self.sample(
  File "/nlp_data/work/chentianyang/minillm/transformers/src/transformers/generation/utils.py", line 2500, in sample
    world_size=mpu.get_model_parallel_world_size(),
  File "/nlp_data/work/chentianyang/minillm/transformers/src/transformers/mpu/initialize.py", line 92, in get_model_parallel_world_size
    return torch.distributed.get_world_size(group=get_model_parallel_group())
  File "/nlp_data/work/chentianyang/minillm/transformers/src/transformers/mpu/initialize.py", line 78, in get_model_parallel_group
    assert _MODEL_PARALLEL_GROUP is not None, \
AssertionError: model parallel group is not initialized

How to slove the problem?

t1101675 commented 1 year ago

For A100, I think the attribute is_model_parallel in config.json should be set to False, which indicates that there is no model parallel used. We will fix this in convert_mp.py in the later version.

For 4xA10, you can uncomment lines 36-37 in scripts/llama/eval/eval_main_dolly.sh to use model parallel. We can add an example script in the next version.

SleepEarlyLiveLong commented 1 year ago

Thanks a lot! I solved the issue.