Closed YJHMITWEB closed 1 year ago
The message says it's [--local-rank LOCAL_RANK]
now. I believe there was a PR recently that changed it from --local_rank
.
The message says it's
[--local-rank LOCAL_RANK]
now. I believe there was a PR recently that changed it from--local_rank
.
Thanks, I modified it and now be able to run it successfully.
Hi, I am using the latest repo. I have installed torch-1.13, cuda 11.6, and deepspeed 0.10.0. And my system has two A100 GPUs. When I run
./ds_pretrain_gpt_125M_MoE64.sh
underMegatron-DeepSpeed/examples_deepspeed/MoE
, the following errors occur:Further, I tried to modify the .sh script by adding "CUDA_VISIBLE_DEVICES=0,1" to the
run_cmd
, but the issue still exists.