hpcaitech / ColossalAI

Making large AI models cheaper, faster and more accessible
https://www.colossalai.org
Apache License 2.0
38.62k stars 4.33k forks source link

[BUG]: 自动化训练脚本auto_parallel_with_gpt.py强制退出,卡在设置完set_device之后 #4418

Closed wangbluo closed 1 year ago

wangbluo commented 1 year ago

🐛 Describe the bug

环境: pytorch 1.12 a100八卡

执行脚本:auto_parallel_with_gpt.py

执行命令:多机多卡:colossalai run --nproc_per_node 8 --host 10.90.8.162,10.90.9.27 --master_addr 10.90.8.162 auto_parallel_with_gpt.py 单机八卡:torchrun --standalone --nproc_per_node=8 auto_parallel_with_gpt.py 多机多卡和单机八卡都会强制退出,其中多机多卡的报错是: root@test-luo-0:/workspace/workfile/ColossalAI-main/examples/language/gpt/experiments/auto_parallel# colossalai run --nproc_per_node 8 --host 10.90.8.162,10.90.9.27 --master_addr 10.90.8.162 auto_parallel_with_gpt.py WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


[08/11/23 17:21:47] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/c ontext/parallel_context.py:522 set_device
[08/11/23 17:21:47] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/c ontext/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 4 is
bound to device 4
[08/11/23 17:21:47] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/c ontext/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 7 is
bound to device 7
[08/11/23 17:21:47] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/c ontext/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 5 is
bound to device 5
[08/11/23 17:21:47] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/c ontext/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 1 is
bound to device 1
[08/11/23 17:21:47] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/c ontext/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 3 is
bound to device 3
[08/11/23 17:21:47] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/c ontext/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 0 is
bound to device 0
[08/11/23 17:21:47] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/c ontext/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 6 is
bound to device 6
INFO colossalai - colossalai - INFO: process rank 2 is
bound to device 2
[08/11/23 17:21:47] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/c ontext/parallel_context.py:522 set_device
[08/11/23 17:21:47] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/c ontext/parallel_context.py:522 set_device
[08/11/23 17:21:47] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/c ontext/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 9 is
bound to device 1
[08/11/23 17:21:47] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/c ontext/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 11 is bound to device 3
[08/11/23 17:21:47] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/c ontext/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 14 is bound to device 6
[08/11/23 17:21:47] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/c ontext/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 8 is
bound to device 0
[08/11/23 17:21:47] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/c ontext/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 15 is bound to device 7
[08/11/23 17:21:47] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/c ontext/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 10 is bound to device 2
INFO colossalai - colossalai - INFO: process rank 13 is bound to device 5
INFO colossalai - colossalai - INFO: process rank 12 is bound to device 4
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1623 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 1624) of binary: /usr/local/bin/python3.9 WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1980 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 1 (pid: 1981) of binary: /usr/local/bin/python3.9 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/usr/local/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/usr/local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

auto_parallel_with_gpt.py FAILED

Failures: [1]: time : 2023-08-11_17:21:54 host : test-luo-0 rank : 2 (local_rank: 2) exitcode : -9 (pid: 1982) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 1982 [2]: time : 2023-08-11_17:21:54 host : test-luo-0 rank : 3 (local_rank: 3) exitcode : -9 (pid: 1983) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 1983 [3]: time : 2023-08-11_17:21:54 host : test-luo-0 rank : 4 (local_rank: 4) exitcode : -9 (pid: 1984) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 1984 [4]: time : 2023-08-11_17:21:54 host : test-luo-0 rank : 5 (local_rank: 5) exitcode : -9 (pid: 1985) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 1985 [5]: time : 2023-08-11_17:21:54 host : test-luo-0 rank : 6 (local_rank: 6) exitcode : -9 (pid: 1986) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 1986 [6]: time : 2023-08-11_17:21:54 host : test-luo-0 rank : 7 (local_rank: 7) exitcode : -9 (pid: 1987) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 1987

Root Cause (first observed failure): [0]: time : 2023-08-11_17:21:54 host : test-luo-0 rank : 1 (local_rank: 1) exitcode : -9 (pid: 1981) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 1981

Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/usr/local/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/usr/local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

auto_parallel_with_gpt.py FAILED

Failures: [1]: time : 2023-08-11_17:21:54 host : test-luo-1 rank : 10 (local_rank: 2) exitcode : -9 (pid: 1625) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 1625 [2]: time : 2023-08-11_17:21:54 host : test-luo-1 rank : 11 (local_rank: 3) exitcode : -9 (pid: 1626) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 1626 [3]: time : 2023-08-11_17:21:54 host : test-luo-1 rank : 12 (local_rank: 4) exitcode : -9 (pid: 1627) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 1627 [4]: time : 2023-08-11_17:21:54 host : test-luo-1 rank : 13 (local_rank: 5) exitcode : -9 (pid: 1628) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 1628 [5]: time : 2023-08-11_17:21:54 host : test-luo-1 rank : 14 (local_rank: 6) exitcode : -9 (pid: 1629) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 1629 [6]: time : 2023-08-11_17:21:54 host : test-luo-1 rank : 15 (local_rank: 7) exitcode : -9 (pid: 1630) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 1630

Root Cause (first observed failure): [0]: time : 2023-08-11_17:21:54 host : test-luo-1 rank : 9 (local_rank: 1) exitcode : -9 (pid: 1624) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 1624

Error: failed to run torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=10.90.8.162:29500 --rdzv_id=colossalai-default-job auto_parallel_with_gpt.py on 10.90.8.162, is localhost: True, exception: Encountered a bad command exit code!

Command: 'cd /workspace/workfile/ColossalAI-main/examples/language/gpt/experiments/auto_parallel && export SHELL="/bin/bash" PWD="/workspace/workfile/ColossalAI-main/examples/language/gpt/experiments/auto_parallel" LOGNAME="root" MOTD_SHOWN="pam" HOME="/root" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:*.xspf=00;36:" SSH_CONNECTION="10.90.9.27 44938 10.90.8.162 22" TERM="xterm" USER="root" SHLVL="1" SSH_CLIENT="10.90.9.27 44938 22" PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin" SSHTTY="/dev/pts/1" ="/usr/local/bin/colossalai" OLDPWD="/root" LC_CTYPE="C.UTF-8" && torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --rdzv_backend=c10d --rdzv_endpoint=10.90.8.162:29500 --rdzv_id=colossalai-default-job auto_parallel_with_gpt.py'

Exit code: 1

Stdout: already printed

Stderr: already printed

Error: failed to run torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --rdzv_backend=c10d --rdzv_endpoint=10.90.8.162:29500 --rdzv_id=colossalai-default-job auto_parallel_with_gpt.py on 10.90.9.27, is localhost: False, exception: Encountered a bad command exit code!

Command: 'cd /workspace/workfile/ColossalAI-main/examples/language/gpt/experiments/auto_parallel && export SHELL="/bin/bash" PWD="/workspace/workfile/ColossalAI-main/examples/language/gpt/experiments/auto_parallel" LOGNAME="root" MOTD_SHOWN="pam" HOME="/root" LS_COLORS="rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:.tar=01;31:.tgz=01;31:.arc=01;31:.arj=01;31:.taz=01;31:.lha=01;31:.lz4=01;31:.lzh=01;31:.lzma=01;31:.tlz=01;31:.txz=01;31:.tzo=01;31:.t7z=01;31:.zip=01;31:.z=01;31:.dz=01;31:.gz=01;31:.lrz=01;31:.lz=01;31:.lzo=01;31:.xz=01;31:.zst=01;31:.tzst=01;31:.bz2=01;31:.bz=01;31:.tbz=01;31:.tbz2=01;31:.tz=01;31:.deb=01;31:.rpm=01;31:.jar=01;31:.war=01;31:.ear=01;31:.sar=01;31:.rar=01;31:.alz=01;31:.ace=01;31:.zoo=01;31:.cpio=01;31:.7z=01;31:.rz=01;31:.cab=01;31:.wim=01;31:.swm=01;31:.dwm=01;31:.esd=01;31:.jpg=01;35:.jpeg=01;35:.mjpg=01;35:.mjpeg=01;35:.gif=01;35:.bmp=01;35:.pbm=01;35:.pgm=01;35:.ppm=01;35:.tga=01;35:.xbm=01;35:.xpm=01;35:.tif=01;35:.tiff=01;35:.png=01;35:.svg=01;35:.svgz=01;35:.mng=01;35:.pcx=01;35:.mov=01;35:.mpg=01;35:.mpeg=01;35:.m2v=01;35:.mkv=01;35:.webm=01;35:.ogm=01;35:.mp4=01;35:.m4v=01;35:.mp4v=01;35:.vob=01;35:.qt=01;35:.nuv=01;35:.wmv=01;35:.asf=01;35:.rm=01;35:.rmvb=01;35:.flc=01;35:.avi=01;35:.fli=01;35:.flv=01;35:.gl=01;35:.dl=01;35:.xcf=01;35:.xwd=01;35:.yuv=01;35:.cgm=01;35:.emf=01;35:.ogv=01;35:.ogx=01;35:.aac=00;36:.au=00;36:.flac=00;36:.m4a=00;36:.mid=00;36:.midi=00;36:.mka=00;36:.mp3=00;36:.mpc=00;36:.ogg=00;36:.ra=00;36:.wav=00;36:.oga=00;36:.opus=00;36:.spx=00;36:*.xspf=00;36:" SSH_CONNECTION="10.90.9.27 44938 10.90.8.162 22" TERM="xterm" USER="root" SHLVL="1" SSH_CLIENT="10.90.9.27 44938 22" PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin" SSHTTY="/dev/pts/1" ="/usr/local/bin/colossalai" OLDPWD="/root" LC_CTYPE="C.UTF-8" && torchrun --nproc_per_node=8 --nnodes=2 --node_rank=1 --rdzv_backend=c10d --rdzv_endpoint=10.90.8.162:29500 --rdzv_id=colossalai-default-job auto_parallel_with_gpt.py'

Exit code: 1

Stdout: already printed

Stderr: already printed

====== Training on All Nodes ===== 10.90.8.162: failure 10.90.9.27: failure

====== Stopping All Nodes ===== 10.90.8.162: finish 10.90.9.27: finish

单机八卡的报错: root@test-luo-0:/workspace/workfile/ColossalAI-main/examples/language/gpt/experiments/auto_parallel# torchrun --standalone --nproc_per_node=8 auto_parallel_with_gpt.py WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


[08/11/23 16:44:07] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/context/parallel_context.py:522 set_device
[08/11/23 16:44:07] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/context/parallel_context.py:522 set_device
[08/11/23 16:44:07] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/context/parallel_context.py:522 set_device
[08/11/23 16:44:07] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/context/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 7 is bound to device 7
[08/11/23 16:44:07] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/context/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 1 is bound to device 1
[08/11/23 16:44:07] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/context/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 6 is bound to device 6
[08/11/23 16:44:07] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/context/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 0 is bound to device 0
[08/11/23 16:44:07] INFO colossalai - colossalai - INFO:
/usr/local/lib/python3.9/site-packages/colossalai/context/parallel_context.py:522 set_device
INFO colossalai - colossalai - INFO: process rank 4 is bound to device 4
INFO colossalai - colossalai - INFO: process rank 3 is bound to device 3
INFO colossalai - colossalai - INFO: process rank 2 is bound to device 2
INFO colossalai - colossalai - INFO: process rank 5 is bound to device 5
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1552 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1553 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1556 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1557 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1558 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 2 (pid: 1554) of binary: /usr/local/bin/python3.9 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 8, in sys.exit(main()) File "/usr/local/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/usr/local/lib/python3.9/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/usr/local/lib/python3.9/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/usr/local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/local/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

auto_parallel_with_gpt.py FAILED

Failures: [1]: time : 2023-08-11_16:44:10 host : test-luo-0 rank : 3 (local_rank: 3) exitcode : -9 (pid: 1555) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 1555 [2]: time : 2023-08-11_16:44:10 host : test-luo-0 rank : 7 (local_rank: 7) exitcode : -9 (pid: 1559) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 1559

Root Cause (first observed failure): [0]: time : 2023-08-11_16:44:10 host : test-luo-0 rank : 2 (local_rank: 2) exitcode : -9 (pid: 1554) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 1554

Environment

No response

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


Title: [BUG]: The automated training script auto_parallel_with_gpt.py is forced to exit, stuck after setting set_device

wangbluo commented 1 year ago

环境问题,已解决,遇到类似问题可开启nccl测试一下不同节点是否能够通信,另外也有其他同学表示多机多卡需要ssh免密互通,可供参考。

Issues-translate-bot commented 1 year ago

Bot detected the issue body's language is not English, translate it automatically. 👯👭🏻🧑‍🤝‍🧑👫🧑🏿‍🤝‍🧑🏻👩🏾‍🤝‍👨🏿👬🏿


The environmental problem has been solved. If you encounter similar problems, you can open nccl to test whether different nodes can communicate. In addition, other students said that multi-machine and multi-card need SSH password-free intercommunication, which can be used for reference.