modelscope / 3D-Speaker

A Repository for Single- and Multi-modal Speaker Verification, Speaker Recognition and Speaker Diarization
Apache License 2.0
1.07k stars 93 forks source link

Problem with training part. #59

Closed NathanJHLee closed 6 months ago

NathanJHLee commented 6 months ago

Hi, I am Nathan and i am facing some problem with training part.

My env Centos7.5

PIP

pytorch-wpe 0.0.1 rotary-embedding-torch 0.5.3 torch 1.12.1+cu113 //To use cuda, I did reinstall torch and torchaudio. torch-complex 0.4.3 torchaudio 0.12.1+cu113 torchvision 0.13.1+cu113

rpm

libcudnn8-devel-8.2.0.53-1.cuda11.3.x86_64 libcudnn8-8.2.0.53-1.cuda11.3.x86_64

libnccl-devel-2.9.9-1+cuda11.3.x86_64 libnccl-2.9.9-1+cuda11.3.x86_64

To run a script , I follow 'egs/voxceleb/sv-ecapa/run.sh' I set 4 gpus. (When i set single gpu, It's not working too) But I got error as below.

Stage3: Training the speaker model... WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


2024-02-15 14:31:58,001 - INFO: Use GPU: 3 for training. 2024-02-15 14:31:58,003 - INFO: Use GPU: 2 for training. 2024-02-15 14:31:58,009 - INFO: Use GPU: 1 for training. 2024-02-15 14:31:58,011 - INFO: Use GPU: 0 for training. Traceback (most recent call last): File "speakerlab/bin/train.py", line 176, in main() File "speakerlab/bin/train.py", line 60, in main model = torch.nn.parallel.DistributedDataParallel(model) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in init Traceback (most recent call last): File "speakerlab/bin/train.py", line 176, in _verify_param_shape_across_processes(self.process_group, parameters) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes return dist._verify_params_across_processes(process_group, tensors, logger) RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer. main() File "speakerlab/bin/train.py", line 60, in main model = torch.nn.parallel.DistributedDataParallel(model) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in init _verify_param_shape_across_processes(self.process_group, parameters) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes return dist._verify_params_across_processes(process_group, tensors, logger) RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer. Traceback (most recent call last): File "speakerlab/bin/train.py", line 176, in main() File "speakerlab/bin/train.py", line 60, in main model = torch.nn.parallel.DistributedDataParallel(model) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in init _verify_param_shape_across_processes(self.process_group, parameters) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes return dist._verify_params_across_processes(process_group, tensors, logger) RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer. Traceback (most recent call last): File "speakerlab/bin/train.py", line 176, in main() File "speakerlab/bin/train.py", line 60, in main model = torch.nn.parallel.DistributedDataParallel(model) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 646, in init _verify_param_shape_across_processes(self.process_group, parameters) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/utils.py", line 89, in _verify_param_shape_across_processes return dist._verify_params_across_processes(process_group, tensors, logger) RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1191, unhandled system error, NCCL version 2.10.3 ncclSystemError: System call (e.g. socket, malloc) or external library call failed or device error. It can be also caused by unexpected exit of a remote peer, you can check NCCL warnings for failure reason and see if there is connection closure by a peer. WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 121550 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 121547) of binary: /home/asr/miniconda3/envs/3D-Speaker/bin/python Traceback (most recent call last): File "/home/asr/miniconda3/envs/3D-Speaker/bin/torchrun", line 8, in sys.exit(main()) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/run.py", line 761, in main run(args) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/run.py", line 752, in run elastic_launch( File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 245, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

speakerlab/bin/train.py FAILED

Failures: [1]: time : 2024-02-15_14:32:03 host : e7bcf3a85e2c rank : 1 (local_rank: 1) exitcode : 1 (pid: 121548) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html [2]: time : 2024-02-15_14:32:03 host : e7bcf3a85e2c rank : 2 (local_rank: 2) exitcode : 1 (pid: 121549) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

Root Cause (first observed failure): [0]: time : 2024-02-15_14:32:03 host : e7bcf3a85e2c rank : 0 (local_rank: 0) exitcode : 1 (pid: 121547) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

yfchenlucky commented 6 months ago

I successfully re-cloned the repository and executed the run.sh script without encountering any errors. The versions of PyTorch and CUDA installed on my system are 1.12.0 and 10.2, respectively.

yfchenlucky commented 6 months ago

You must attempt the following steps:

  1. Verify the execution permissions for the Python script to ensure it is executable.
  2. The speakerlab within the run.sh directory is a symbolic link. Consider copying the file directory that 3D-Speaker/speakerlab points to and replace the symbolic link with the actual directory.
NathanJHLee commented 6 months ago

Hi, I have one more question.

I believe i solve problem with nccl. But I got another error.

I encounter error 'model = torch.nn.parallel.DistributedDataParallel(model)' in train.py It's mismatch tensor size. I think '192' is embedding_size according to ecapa_tdnn.yaml. please check my error log. Thank you.

Error log is here when i try to use single gpu. Stage3: Training the speaker model... 2024-02-22 18:15:58,831 - INFO: Use GPU: 1 for training. d5acf849f4d8:167887:167887 [1] NCCL INFO Bootstrap : Using eth0:172.17.0.2<0> d5acf849f4d8:167887:167887 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

d5acf849f4d8:167887:167887 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1] d5acf849f4d8:167887:167887 [1] NCCL INFO NET/Socket : Using [0]eth0:172.17.0.2<0> d5acf849f4d8:167887:167887 [1] NCCL INFO Using network Socket NCCL version 2.10.3+cuda11.1 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 00/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 01/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 02/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 03/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 04/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 05/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 06/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 07/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 08/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 09/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 10/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 11/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 12/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 13/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 14/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 15/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 16/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 17/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 18/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 19/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 20/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 21/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 22/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 23/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 24/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 25/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 26/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 27/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 28/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 29/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 30/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Channel 31/32 : 0 d5acf849f4d8:167887:167958 [1] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 d5acf849f4d8:167887:167958 [1] NCCL INFO Setting affinity for GPU 1 to 5555,55555555,55555555 d5acf849f4d8:167887:167958 [1] NCCL INFO Connected all rings d5acf849f4d8:167887:167958 [1] NCCL INFO Connected all trees d5acf849f4d8:167887:167958 [1] NCCL INFO 32 coll channels, 32 p2p channels, 32 p2p channels per peer d5acf849f4d8:167887:167958 [1] NCCL INFO comm 0x7fa6b0002010 rank 0 nranks 1 cudaDev 1 busId 13000 - Init COMPLETE Traceback (most recent call last): File "speakerlab/bin/train.py", line 193, in main() File "speakerlab/bin/train.py", line 70, in main model = torch.nn.parallel.DistributedDataParallel(model) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 580, in init self._sync_params_and_buffers(authoritative_rank=0) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 597, in _sync_params_and_buffers self._distributed_broadcast_coalesced( File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1334, in _distributed_broadcast_coalesced dist._broadcast_coalesced( RuntimeError: The size of tensor a (192) must match the size of tensor b (0) at non-singleton dimension 1 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 167887) of binary: /home/asr/miniconda3/envs/3D-Speaker/bin/python Traceback (most recent call last): File "/home/asr/miniconda3/envs/3D-Speaker/bin/torchrun", line 8, in sys.exit(main()) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper return f(*args, **kwargs) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/run.py", line 719, in main run(args) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/run.py", line 710, in run elastic_launch( File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 131, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/asr/miniconda3/envs/3D-Speaker/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

speakerlab/bin/train.py FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-02-22_18:16:07 host : d5acf849f4d8 rank : 0 (local_rank: 0) exitcode : 1 (pid: 167887) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================