multi-gpu training problem

JasonChai1019 commented 2 months ago

hello, I encountered some problems while using this code for multi-gpu training. first I tried to run it with "python3 train_dafnet.py --model_name "llama-2-7b" --device 0 --extra_device 1 2 3" and I set the 'aux_module_parallel' as false. It worked but GPU usage is very low and the training took too long.

then i run it in 8 a100 gpus with "python3 train_dafnet.py --model_name "llama-2-7b" --device 0 --extra_device 1 2 3 4 5 6 7", and I set the 'aux_module_parallel' as true. However, the training got stuck at the beginning (with no successful cases), and the log was filled with many NCCL-related messages.

please help me fix these problem, thank you.

qizhou000 commented 2 months ago

Hi! Please show me the detailed errors so that I can solve them for you.

JasonChai1019 commented 2 months ago

I run it in 8 a100 gpus with "python3 train_dafnet.py --model_name "llama-2-7b" --device 0 --extra_device 1 2 3 4 5 6 7", and I set the 'aux_module_parallel' as true.

The data was loaded normally Here are some of the contents displayed in the running log

(4096, 11008) Auxiliary model devices: [2, 4, 5, 6, 7, 4] (11008, 4096) Auxiliary model devices: [3, 5, 6]

psx5pdexqsya8hrc-worker-0:3046:3046 [3] NCCL INFO cudaDriverVersion 12020 psx5pdexqsya8hrc-worker-0:3046:3046 [3] NCCL INFO Bootstrap : Using eth0:10.166.178.84<0> psx5pdexqsya8hrc-worker-0:3046:3046 [3] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation NCCL version 2.14.3+cuda11.7 psx5pdexqsya8hrc-worker-0:3046:3142 [3] NCCL INFO NET/IB : Using [0]mlx5_4:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_0:1/RoCE [3]mlx5_7:1/RoCE [4]mlx5_5:1/RoCE [5]mlx5_3:1/RoCE [6]mlx5_1:1/RoCE [7]mlx5_6:1/RoCE ; OOB eth0:10.166.178.84<0> psx5pdexqsya8hrc-worker-0:3046:3142 [3] NCCL INFO Using network IB psx5pdexqsya8hrc-worker-0:3046:3143 [5] NCCL INFO Using network IB psx5pdexqsya8hrc-worker-0:3046:3144 [6] NCCL INFO Using network IB psx5pdexqsya8hrc-worker-0:3046:3142 [3] NCCL INFO Setting affinity for GPU 3 to f0,00000000,00000000,00000000,000000f0 psx5pdexqsya8hrc-worker-0:3046:3142 [3] NCCL INFO Channel 00/24 : 0 1 2 psx5pdexqsya8hrc-worker-0:3046:3142 [3] NCCL INFO Channel 01/24 : 0 1 2 psx5pdexqsya8hrc-worker-0:3046:3142 [3] NCCL INFO Channel 02/24 : 0 1 2 psx5pdexqsya8hrc-worker-0:3046:3142 [3] NCCL INFO Channel 03/24 : 0 1 2

psx5pdexqsya8hrc-worker-0:3046:3144 [6] NCCL INFO Channel 00/0 : 2[e1000] -> 0[6a000] via P2P/direct pointer/read psx5pdexqsya8hrc-worker-0:3046:3143 [5] NCCL INFO Channel 00/0 : 1[a8000] -> 2[e1000] via P2P/direct pointer/read psx5pdexqsya8hrc-worker-0:3046:3142 [3] NCCL INFO Channel 00/0 : 0[6a000] -> 1[a8000] via P2P/direct pointer/read psx5pdexqsya8hrc-worker-0:3046:3144 [6] NCCL INFO Channel 01/0 : 2[e1000] -> 0[6a000] via P2P/direct pointer/read psx5pdexqsya8hrc-worker-0:3046:3142 [3] NCCL INFO Channel 01/0 : 0[6a000] -> 1[a8000] via P2P/direct pointer/read psx5pdexqsya8hrc-worker-0:3046:3143 [5] NCCL INFO Channel 01/0 : 1[a8000] -> 2[e1000] via P2P/direct pointer/read psx5pdexqsya8hrc-worker-0:3046:3144 [6] NCCL INFO Channel 02/0 : 2[e1000] -> 0[6a000] via P2P/direct pointer/read psx5pdexqsya8hrc-worker-0:3046:3142 [3] NCCL INFO Channel 02/0 : 0[6a000] -> 1[a8000] via P2P/direct pointer/read psx5pdexqsya8hrc-worker-0:3046:3143 [5] NCCL INFO Channel 02/0 : 1[a8000] -> 2[e1000] via P2P/direct pointer/read

psx5pdexqsya8hrc-worker-0:3046:3144 [6] NCCL INFO Connected all trees psx5pdexqsya8hrc-worker-0:3046:3142 [3] NCCL INFO Connected all trees psx5pdexqsya8hrc-worker-0:3046:3144 [6] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512 psx5pdexqsya8hrc-worker-0:3046:3142 [3] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512 psx5pdexqsya8hrc-worker-0:3046:3143 [5] NCCL INFO Connected all trees psx5pdexqsya8hrc-worker-0:3046:3143 [5] NCCL INFO threadThresholds 8/8/64 | 24/8/64 | 512 | 512 psx5pdexqsya8hrc-worker-0:3046:3144 [6] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer psx5pdexqsya8hrc-worker-0:3046:3142 [3] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer psx5pdexqsya8hrc-worker-0:3046:3143 [5] NCCL INFO 24 coll channels, 32 p2p channels, 32 p2p channels per peer psx5pdexqsya8hrc-worker-0:3046:3144 [6] NCCL INFO comm 0x8f175470 rank 2 nranks 3 cudaDev 6 busId e1000 - Init COMPLETE psx5pdexqsya8hrc-worker-0:3046:3142 [3] NCCL INFO comm 0x8f166710 rank 0 nranks 3 cudaDev 3 busId 6a000 - Init COMPLETE psx5pdexqsya8hrc-worker-0:3046:3143 [5] NCCL INFO comm 0x8f16f880 rank 1 nranks 3 cudaDev 5 busId a8000 - Init COMPLETE

qizhou000 commented 2 months ago

Hello! This is the NCCL running log, and it doesn't seem like any errors have occurred.

qizhou000 / DAFNet

multi-gpu training problem #1