Open qingchu123 opened 1 year ago
Hi @qingchu123 could you report which version of DeepSpeed you are running?
@jomayeri
i use pip show deepspeed
and it shows:
Name: deepspeed
Version: 0.9.3+5c6da1f0
Summary: DeepSpeed library
Home-page: http://deepspeed.ai
Author: DeepSpeed Team
Author-email: deepspeed-info@microsoft.com
License: Apache Software License 2.0
Location: /opt/conda/lib/python3.8/site-packages
Requires: hjson, ninja, numpy, packaging, psutil, py-cpuinfo, pydantic, torch, tqdm
i have install the git last deepspeed,Commits on May 13, 2023,sha:5c6da1f001f936234a31a238e71ca386e34eb51a
@qingchu123 try adjusting the --inference_tp_size
to a lower number, it may be you don't have enough GPUs across your nodes.
try adjusting the
--inference_tp_size
to a lower number, it may be you don't have enough GPUs across your nodes.
thanks,it work
my training environment is a docker image pulled from
deepspeed/deepspeed:v072_torch112_cu117
and i run it withdocker run -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --network train-net --name fuyx-work -v /home/fuyx/big_disk_1000/DeepSpeedExamples/applications/DeepSpeed-Chat:/root/DeepSpeed-Chat b1d
in a overlay docker network. then after i complete The previous two steps,i run the last step bypython train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type multi_node --step 3
my hostfile isand i get this error
the deepspeed command is below,i don't have any change except reduce some batch size to slow the gpu's pressure: