Closed ShiyangLai closed 2 weeks ago
I think these 2 issues are related to each other:
https://github.com/pyannote/pyannote-audio/issues/1538#issuecomment-2033001593
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Tested versions
System information
SUSE Linux 15 SP3 - pyannote.audio-3.1.1 - ALCF Polaris
Issue description
I tried to train a VAD model with 4 GPUs on a single node. The error message is:
Traceback (most recent call last): File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/pipelines/vad.py", line 157, in <module> model = train_model(args, protocol) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/pipelines/vad.py", line 118, in train_model trainer.fit(model) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch return function(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 949, in _run call._call_setup_hook(self) # allow user to set up LightningModule in accelerator environment File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 94, in _call_setup_hook _call_lightning_module_hook(trainer, "setup", stage=fn) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook output = fn(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pyannote/audio/core/model.py", line 264, in setup _ = self.example_output File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/functools.py", line 967, in __get__ val = self.func(instance) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pyannote/audio/core/model.py", line 195, in example_output example_output = self(example_input_array) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pyannote/audio/models/segmentation/PyanNet.py", line 172, in forward outputs = self.sincnet(waveforms) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/pyannote/audio/models/blocks/sincnet.py", line 81, in forward outputs = self.wav_norm1d(waveforms) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/instancenorm.py", line 87, in forward return self._apply_instance_norm(input) File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/modules/instancenorm.py", line 36, in _apply_instance_norm return F.instance_norm( File "/lus/grand/projects/BPC/ra/shiyanglai/conv_rec_framework/polaris/build/pyannote2/lib/python3.8/site-packages/torch/nn/functional.py", line 2526, in instance_norm return torch.instance_norm( RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:3 and cuda:0! (when checking argument for argument weight in method wrapper_CUDA__cudnn_batch_norm) [rank: 1] Child process with PID 35924 terminated with code 1. Forcefully terminating all other processes to avoid zombies 🧟 /var/spool/pbs/mom_priv/jobs/1788633.polaris-pbs-01.hsn.cm.polaris.alcf.anl.gov.SC: line 37: 35559 Killed python vad.py
It seems to me like the data was not loaded to GPU devices. Below is a simplified version of my training script:
And here is my bash script:
Minimal reproduction example (MRE)
None