microsoft / LMOps

General technology for enabling AI capabilities w/ LLMs and MLLMs
MIT License
3.73k stars 283 forks source link

[minillm] how to eval sft/llama-13B with 1 A100 GPU or 4 A10 GPUs? #58

Closed SleepEarlyLiveLong closed 1 year ago

SleepEarlyLiveLong commented 1 year ago

I want to run scripts/llama/eval/ to evaluate sft/llama-13B, I have access to 1 A100 gpu OR 4 A10 gpus, how should I modify the scripts/llama/eval/ file to get it work? I tried the following order on 1 A100 gpu:

python --base-path /data/LMOps/minillm --model-path checkpoints/llama/train/sft/llama-13B/ --ckpt-name sft/llama-13B --n-gpu 1 --model-type llama --data-dir /data/LMOps/minillm/data/dolly --data-names dolly --num-workers 0 --dev-num -1 --data-process-workers -1 --json-data --eval-batch-size 8 --max-length 512 --max-prompt-length 256 --do-eval --save /data/LMOps/minillm/checkpoints/llama/eval_main/ --seed 10 --deepspeed --deepspeed_config /data/LMOps/minillm/configs/deepspeed/ds_config.json --type eval_main --do-sample --top-k 0 --top-p 1.0 --temperature 1.0

and files in checkpoints/llama/train/sft/llama-13B/ are: image where pytorch_model.bin is converted from mp4/ using the released file tools/ however, it gives bugs as following:

Traceback (most recent call last):
  File "/nlp_data/work/chentianyang/minillm/", line 145, in <module>
  File "/nlp_data/work/chentianyang/minillm/", line 138, in main
    evaluate_main(args, tokenizer, model, dataset["test"], "test", 0, device)       # eval core code
  File "/nlp_data/work/chentianyang/minillm/", line 161, in evaluate_main
    lm_loss, query_ids, response_ids, t_used_avg = run_model(args, tokenizer, model, dataset, epoch, device)    # lm_loss: 整个test集500个句子的average loss
  File "/nlp_data/work/chentianyang/minillm/", line 118, in run_model
    gen_out = model.generate(
  File "/root/anaconda3/envs/lmops/lib/python3.9/site-packages/torch/autograd/", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/nlp_data/work/chentianyang/minillm/transformers/src/transformers/generation/", line 1454, in generate
    return self.sample(
  File "/nlp_data/work/chentianyang/minillm/transformers/src/transformers/generation/", line 2500, in sample
  File "/nlp_data/work/chentianyang/minillm/transformers/src/transformers/mpu/", line 92, in get_model_parallel_world_size
    return torch.distributed.get_world_size(group=get_model_parallel_group())
  File "/nlp_data/work/chentianyang/minillm/transformers/src/transformers/mpu/", line 78, in get_model_parallel_group
    assert _MODEL_PARALLEL_GROUP is not None, \
AssertionError: model parallel group is not initialized

How to slove the problem?

t1101675 commented 1 year ago

For A100, I think the attribute is_model_parallel in config.json should be set to False, which indicates that there is no model parallel used. We will fix this in in the later version.

For 4xA10, you can uncomment lines 36-37 in scripts/llama/eval/ to use model parallel. We can add an example script in the next version.

SleepEarlyLiveLong commented 1 year ago

Thanks a lot! I solved the issue.