multi-gpu training not working with accelerate

FrancescoVV commented 9 months ago

I am having issues running the code on accelerate: the semantic model crashes after the first iteration and the coarse and fine models don't have any progress after the first two iterations.

The codebase works for single-GPU training on my machine.

Could anyone provide me a working Dockerfile and a minimal trainer to build a container where accelerate should work?

lucidrains commented 9 months ago

@FrancescoVV i don't think a docker would solve the issue here

could you post the full stack trace?

FrancescoVV commented 9 months ago

This is what I get with the fine model. I get a similar issue with the coarse one, while a different issue with the semantic which I will post later if needed. My accelerate env is:

- `Accelerate` version: 0.22.0
- Platform: Linux-5.4.0-126-generic-x86_64-with-glibc2.17
- Python version: 3.8.18
- Numpy version: 1.24.4
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- PyTorch XPU available: False
- PyTorch NPU available: False
- System RAM: 2015.69 GB
- GPU type: NVIDIA A100-SXM4-80GB
- `Accelerate` default config:
        - compute_environment: LOCAL_MACHINE
        - distributed_type: MULTI_GPU
        - mixed_precision: no
        - use_cpu: False
        - debug: False
        - num_processes: 8
        - machine_rank: 0
        - num_machines: 1
        - gpu_ids: all
        - rdzv_backend: static
        - same_network: True
        - main_training_function: main
        - downcast_bf16: no
        - tpu_use_cluster: False
        - tpu_use_sudo: False
        - tpu_env: []

My training script is:

import math
import wave
import struct
import os
import urllib.request
import tarfile
from audiolm_pytorch import (
    SoundStream,
    SoundStreamTrainer,
    HubertWithKmeans,
    SemanticTransformer,
    SemanticTransformerTrainer,
    HubertWithKmeans,
    CoarseTransformer,
    CoarseTransformerWrapper,
    CoarseTransformerTrainer,
    FineTransformer,
    FineTransformerWrapper,
    FineTransformerTrainer,
    AudioLM,
)
from torch import nn
import torch
import torchaudio

def main():
    # define all dataset paths, checkpoints, etc
    dataset_folder = "./data/LibriTTS"
    hubert_ckpt = "hubert/hubert_base_ls960.pt"
    hubert_quantizer = (
        f"hubert/hubert_base_ls960_L9_km500.bin"  # listed in row "HuBERT Base (~95M params)", column Quantizer
    )

    wav2vec = HubertWithKmeans(checkpoint_path=f"./{hubert_ckpt}", kmeans_path=f"./{hubert_quantizer}")

    from audiolm_pytorch import EncodecWrapper

    encodec = EncodecWrapper()

    fine_transformer = FineTransformer(
        num_coarse_quantizers=3, num_fine_quantizers=5, codebook_size=1024, dim=512, depth=6
    )

    trainer = FineTransformerTrainer(
        transformer=fine_transformer,
        codec=encodec,
        folder=dataset_folder,
        batch_size=64,
        data_max_length=50000,
        save_results_every=10000,
        save_model_every=10000,
        num_train_steps=50_001,
    )

    trainer.train()

if __name__ == "__main__":
    main()

This is the error I get:

training with dataset of 110675 samples and validating with randomly splitted 5825 samples
0: loss: 74.841064453125
0: valid loss 69.40010833740234
0: saving model to results
1: loss: 66.66574096679688
2: loss: 62.35995101928711
[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1800928 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801008 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801014 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801075 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:828] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801259 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801278 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:828] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=6, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1801489 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 3949 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 1 (pid: 3950) of binary: /miniconda/bin/python
Traceback (most recent call last):
File "/miniconda/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/miniconda/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/miniconda/lib/python3.8/site-packages/accelerate/commands/launch.py", line 977, in launch_command
multi_gpu_launcher(args)
File "/miniconda/lib/python3.8/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/miniconda/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/miniconda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/miniconda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=====================================================
train_fine.py FAILED
-----------------------------------------------------
Failures:
[1]:
time : 2023-09-12_21:48:16
host : 5106563
rank : 2 (local_rank: 2)
exitcode : -6 (pid: 3951)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3951
[2]:
time : 2023-09-12_21:48:16
host : 5106563
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 3952)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3952
[3]:
time : 2023-09-12_21:48:16
host : 5106563
rank : 4 (local_rank: 4)
exitcode : -6 (pid: 3953)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3953
[4]:
time : 2023-09-12_21:48:16
host : 5106563
rank : 5 (local_rank: 5)
exitcode : -6 (pid: 3954)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3954
[5]:
time : 2023-09-12_21:48:16
host : 5106563
rank : 6 (local_rank: 6)
exitcode : -6 (pid: 3956)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3956
[6]:
time : 2023-09-12_21:48:16
host : 5106563
rank : 7 (local_rank: 7)
exitcode : -6 (pid: 3958)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3958
-----------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-09-12_21:48:16
host : 5106563
rank : 1 (local_rank: 1)
exitcode : -6 (pid: 3950)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 3950
=====================================================

FrancescoVV commented 9 months ago

The error I get for the coarse training is very similar. I get a different one on the semantic model instead:

My training script:

# imports
import math
import wave
import struct
import os
import urllib.request
import tarfile
from audiolm_pytorch import (
    SoundStream,
    SoundStreamTrainer,
    HubertWithKmeans,
    SemanticTransformer,
    SemanticTransformerTrainer,
    HubertWithKmeans,
    CoarseTransformer,
    CoarseTransformerWrapper,
    CoarseTransformerTrainer,
    FineTransformer,
    FineTransformerWrapper,
    FineTransformerTrainer,
    AudioLM,
)
from torch import nn
import torch
import torchaudio

def main():
    # define all dataset paths, checkpoints, etc
    dataset_folder = "./data/LibriTTS"
    hubert_ckpt = "hubert/hubert_base_ls960.pt"
    hubert_quantizer = (
        f"hubert/hubert_base_ls960_L9_km500.bin"  # listed in row "HuBERT Base (~95M params)", column Quantizer
    )

    wav2vec = HubertWithKmeans(checkpoint_path=f"./{hubert_ckpt}", kmeans_path=f"./{hubert_quantizer}")

    semantic_transformer = SemanticTransformer(num_semantic_tokens=wav2vec.codebook_size, dim=1024, depth=6).cuda()

    trainer = SemanticTransformerTrainer(
        transformer=semantic_transformer,
        wav2vec=wav2vec,
        folder=dataset_folder,
        batch_size=256,
        data_max_length=50000,
        save_results_every=10000,
        save_model_every=10000,
        num_train_steps=50_001,
    )

    trainer.train()

if __name__ == "__main__":
    main()

The error I get:

training with dataset of 110675 samples and validating with randomly splitted 5825 samples
0: loss: 6.401525020599365
Traceback (most recent call last):
File "train_semantic.py", line 55, in <module>
main()
File "train_semantic.py", line 51, in main
trainer.train()
File "/AudioLM/audiolm_pytorch/trainer.py", line 852, in train
logs = self.train_step()
File "/AudioLM/audiolm_pytorch/trainer.py", line 805, in train_step
loss = self.train_wrapper(**data_kwargs, return_loss=True)
File "/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/miniconda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1139, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 3: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 ...
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Traceback (most recent call last):
File "train_semantic.py", line 55, in <module>
main()
File "train_semantic.py", line 51, in main
trainer.train()
File "/AudioLM/audiolm_pytorch/trainer.py", line 852, in train
logs = self.train_step()
File "/AudioLM/audiolm_pytorch/trainer.py", line 805, in train_step
loss = self.train_wrapper(**data_kwargs, return_loss=True)
File "/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/miniconda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1139, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 2: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 ...
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
Traceback (most recent call last):
File "train_semantic.py", line 55, in <module>
main()
File "train_semantic.py", line 51, in main
trainer.train()
File "/AudioLM/audiolm_pytorch/trainer.py", line 852, in train
logs = self.train_step()
File "/AudioLM/audiolm_pytorch/trainer.py", line 805, in train_step
loss = self.train_wrapper(**data_kwargs, return_loss=True)
File "/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/miniconda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1139, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 1: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 ...
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
0: valid loss 4.485287666320801
0: saving model to results
Traceback (most recent call last):
File "train_semantic.py", line 55, in <module>
main()
File "train_semantic.py", line 51, in main
trainer.train()
File "/AudioLM/audiolm_pytorch/trainer.py", line 852, in train
logs = self.train_step()
File "/AudioLM/audiolm_pytorch/trainer.py", line 805, in train_step
loss = self.train_wrapper(**data_kwargs, return_loss=True)
File "/miniconda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/miniconda/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1139, in forward
if torch.is_grad_enabled() and self.reducer._rebuild_buckets():
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument `find_unused_parameters=True` to `torch.nn.parallel.DistributedDataParallel`, and by
making sure all `forward` function outputs participate in calculating loss.
If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's `forward` function. Please include the loss function and the structure of the return value of `forward` of your module when reporting this issue (e.g. list, dict, iterable).
Parameter indices which did not receive grad for rank 0: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 ...
In addition, you can set the environment variable TORCH_DISTRIBUTED_DEBUG to either INFO or DETAIL to print out information about which particular parameters did not receive gradient on this rank as part of this error
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 357 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 1 (pid: 358) of binary: /miniconda/bin/python
Traceback (most recent call last):
File "/miniconda/bin/accelerate", line 8, in <module>
sys.exit(main())
File "/miniconda/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 45, in main
args.func(args)
File "/miniconda/lib/python3.8/site-packages/accelerate/commands/launch.py", line 977, in launch_command
multi_gpu_launcher(args)
File "/miniconda/lib/python3.8/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/miniconda/lib/python3.8/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/miniconda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/miniconda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_semantic.py FAILED
------------------------------------------------------------
Failures:
[1]:
time : 2023-09-13_13:56:19
host : 5109201
rank : 2 (local_rank: 2)
exitcode : 1 (pid: 359)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
[2]:
time : 2023-09-13_13:56:19
host : 5109201
rank : 3 (local_rank: 3)
exitcode : 1 (pid: 360)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-09-13_13:56:19
host : 5109201
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 358)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

lucidrains commented 9 months ago

shot in the dark here, but could you try inserting os.environ['NCCL_BLOCKING_WAIT'] = '0' at top of the script?

otherwise i'll have to debug this once i get my hands on a multi-GPU environment

FrancescoVV commented 9 months ago

I should have an update on this soon, but cannot do that right now. Expect an answer within 24h

FrancescoVV commented 9 months ago

I don't have a different behaviour with that inserted at the beginning of the script.

lucidrains commented 9 months ago

@FrancescoVV want to try 1.5.2?

FrancescoVV commented 9 months ago

I am still not able to train on multiple gpu, but now the error on all three models looks the same and it is the one I had before for the coarse and fine models: the model seems to do two iterations then hangs there doing nothing until the nccl timeout kicks in and it crashes.

lucidrains commented 9 months ago

@FrancescoVV ok, that's some progress

i got access to my deep learning rig and fixed another issue. do you want to try 1.5.4?

lucidrains commented 9 months ago

@FrancescoVV just to make sure, you are seeing something training on one machine right?

FrancescoVV commented 9 months ago

@FrancescoVV just to make sure, you are seeing something training on one machine right?

I don't understand exactly what you are asking here. If the question is if I only tried training with one node, the answer is yes, I tried only on one node (but I tried 8x V100, 3090 and A100)

It seems to be working now! The only issue I had was when rerunning because the "yes or no" question still blocks the training if previous results exist and answering doesn't unlock, so I had to manually delete the results of the previous test.

lucidrains commented 9 months ago

@FrancescoVV oh what I meant is if you tried any small scale training and see it able to generate audio on one node. sounds like you have, but my question is whether you saw any results at all

that's great news! I'll fix that confirmation issue too, feel free to leave the issue open until I do

lucidrains commented 9 months ago

@FrancescoVV could you see if that confirmation issue is fixed in 1.5.5?

lucidrains / audiolm-pytorch

multi-gpu training not working with accelerate #234