Closed wohenniubi closed 1 year ago
Got the same issue.
It looks like that this is the problem of the newer version of Deepspeed. Everything works just fine after I downgrade deepspeed==0.7.3
.
It looks like that this is the problem of the newer version of Deepspeed. Everything works just fine after I downgrade
deepspeed==0.7.3
.
Many thanks for the help and sorry for the late response. Just test your suggestion and yes, the Bloom176B could work after downgrade deepspeed to 0.7.3. Thus could this issue
Run cmd & Error:
Using nvidia-py docker 23.04 on A100 and run the Bloom176B cmd
deepspeed --num_gpus 8 --module inference_server.benchmark --model_name bigscience/bloom --model_class AutoModelForCausalLM --dtype fp16 --deployment_framework ds_inference --benchmark_cycles 5 --batch_size 1
It will lead to the following error, while Bloom-7b1 model has no such issue.
The complete stack trace is attached here. Please kindly take a look and welcome any prompt suggestionss Bloom176B_error_BFloat16.txt
Further info:
Detailed logs:
Bloom176B Checkpoint location checkpoint is downloaded by
Here are the dependent modules.
Nvidia-smi info totally 8 Cards, and I only paste 2 cards info
nvcc --version info