PIP

pytorch-wpe 0.0.1 rotary-embedding-torch 0.5.3 torch 1.12.1+cu113 //To use cuda, I did reinstall torch and torchaudio. torch-complex 0.4.3 torchaudio 0.12.1+cu113 torchvision 0.13.1+cu113

rpm

libcudnn8-devel-8.2.0.53-1.cuda11.3.x86_64 libcudnn8-8.2.0.53-1.cuda11.3.x86_64

libnccl-devel-2.9.9-1+cuda11.3.x86_64 libnccl-2.9.9-1+cuda11.3.x86_64

To run a script , I follow 'egs/voxceleb/sv-ecapa/run.sh' I set 4 gpus. (When i set single gpu, It's not working too) But I got error as below.

Stage3: Training the speaker model... WARNING:torch.distributed.run:

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

2024-02-15 14:31:58,001 - INFO: Use GPU: 3 for training. 2024-02-15 14:31:58,003 - INFO: Use GPU: 2 for training. 2024-02-15 14:31:58,009 - INFO: Use GPU: 1 for training. 2024-02-15 14:31:58,011 - INFO: Use GPU: 0 for training. Traceback (most recent call last): File "speakerlab/bin/train.py", line 176, in main() File "speakerlab/bin/train.py", line 60, in main model = torch.nn.parallel.DistributedDataParallel(model) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in init Traceback (most recent call last): File "speakerlab/bin/train.py", line 176, in _verify_param_shape_across_processes(self.process_group, parameters) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes return dist._verify_params_across_processes(process_group, tensors, logger) RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer. main() File "speakerlab/bin/train.py", line 60, in main model = torch.nn.parallel.DistributedDataParallel(model) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in init _verify_param_shape_across_processes(self.process_group, parameters) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes return dist._verify_params_across_processes(process_group, tensors, logger) RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer. Traceback (most recent call last): File "speakerlab/bin/train.py", line 176, in main() File "speakerlab/bin/train.py", line 60, in main model = torch.nn.parallel.DistributedDataParallel(model) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in init _verify_param_shape_across_processes(self.process_group, parameters) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes return dist._verify_params_across_processes(process_group, tensors, logger) RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer. Traceback (most recent call last): File "speakerlab/bin/train.py", line 176, in main() File "speakerlab/bin/train.py", line 60, in main model = torch.nn.parallel.DistributedDataParallel(model) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in init _verify_param_shape_across_processes(self.process_group, parameters) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes return dist._verify_params_across_processes(process_group, tensors, logger) RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 121550 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 121547) of binary: /home/asr/miniconda3/envs/3D-Speaker/bin/python Traceback (most recent call last): File "/home/asr/miniconda3/envs/3D-Speaker/bin/torchrun", line 8, in sys.exit(main()) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, kwargs) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

speakerlab/bin/train.py FAILED

Failures: [1]: time : 2024-02-15_14:32:03 host : e7bcf3a85e2c rank : 1 (local_rank: 1) exitcode : 1 (pid: 121548) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-02-15_14:32:03 host : e7bcf3a85e2c rank : 2 (local_rank: 2) exitcode : 1 (pid: 121549) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-02-15_14:32:03 host : e7bcf3a85e2c rank : 0 (local_rank: 0) exitcode : 1 (pid: 121547) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

I successfully re-cloned the repository and executed the run.sh script without encountering any errors. The versions of PyTorch and CUDA installed on my system are 1.12.0 and 10.2, respectively.

You must attempt the following steps:

Verify the execution permissions for the Python script to ensure it is executable.
The speakerlab within the run.sh directory is a symbolic link. Consider copying the file directory that 3D-Speaker/speakerlab points to and replace the symbolic link with the actual directory.

Hi, I have one more question.

I believe i solve problem with nccl. But I got another error.

I encounter error 'model = torch.nn.parallel.DistributedDataParallel(model)' in train.py It's mismatch tensor size. I think '192' is embedding_size according to ecapa_tdnn.yaml. please check my error log. Thank you.

Error log is here when i try to use single gpu. Stage3: Training the speaker model... 2024-02-22 18:15:58,831 - INFO: Use GPU: 1 for training. d5acf849f4d8:167887:167887 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0> d5acf849f4d8:167887:167887 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

d5acf849f4d8:167887:167887 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1] d5acf849f4d8:167887:167887 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0> d5acf849f4d8:167887:167887 [1] NCCL INFO Using network Socket NCCL version 2.10.3+cuda11.1 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 00/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 01/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 02/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 03/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 04/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 05/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 06/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 07/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 08/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 09/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 10/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 11/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 12/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 13/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 14/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 15/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 16/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 17/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 18/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 19/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 20/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 21/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 22/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 23/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 24/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 25/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 26/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 27/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 28/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 29/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 30/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 31/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 d5acf849f4d8:167887:167958 [1] NCCL INFO Setting affinity for GPU 1 to 5555,55555555,55555555 d5acf849f4d8:167887:167958 [1] NCCL INFO Connected all rings d5acf849f4d8:167887:167958 [1] NCCL INFO Connected all trees d5acf849f4d8:167887:167958 [1] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer d5acf849f4d8:167887:167958 [1] NCCL INFO comm 0x7fa6b0002010 rank 0 nranks 1 cudaDev 1 busId 13000 - Init COMPLETE Traceback (most recent call last): File "speakerlab/bin/train.py", line 193, in main() File "speakerlab/bin/train.py", line 70, in main model = torch.nn.parallel.DistributedDataParallel(model) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 580, in init self._sync_params_and_buffers(authoritative_rank=0) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 597, in _sync_params_and_buffers self._distributed_broadcast_coalesced( File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1334, in _distributed_broadcast_coalesced dist._broadcast_coalesced( RuntimeError: The size of tensor a (192) must match the size of tensor b (0) at non-singleton dimension 1 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 167887) of binary: /home/asr/miniconda3/envs/3D-Speaker/bin/python Traceback (most recent call last): File "/home/asr/miniconda3/envs/3D-Speaker/bin/torchrun", line 8, in sys.exit(main()) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, kwargs) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main run(args) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call** return launch_agent(self._config, self._entrypoint, list(args)) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

speakerlab/bin/train.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-02-22_18:16:07 host : d5acf849f4d8 rank : 0 (local_rank: 0) exitcode : 1 (pid: 167887) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================

modelscope / 3D-Speaker

Problem with training part. #59

PIP

rpm

speakerlab/bin/train.py FAILED

speakerlab/bin/train.py FAILED