Multi-gpu training hangs

rhasspy / piper

A fast, local neural text to speech system

https://rhasspy.github.io/piper-samples/

MIT License

5.71k stars 408 forks source link

Multi-gpu training hangs #95

Open Laope94 opened 1 year ago

Laope94 commented 1 year ago

Hi,

when trying to run training in multiple gpus I am getting this:

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4 Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4 INFO:torch.distributed.distributed_c10d:Added key: store_based_barrier_key:1 to store for rank: 1 INFO:torch.distributed.distributed_c10d:Waiting in store based barrier to initialize process group for rank: 1, key: store_based_barrier_key:1 (world_size=4, worker_count=1, timeout=0:30:00)

Last message then repeats forever and I need to cancel with ctrl+z

this is my run command:

python3 -m piper_train --dataset-dir /home/train_data/processed/ --quality 'medium' --accelerator 'gpu' --devices 2 --num_nodes 2 --validation-split 0.05 --num-test-examples 5 --max_steps 3000 --max_epochs 1 --enable_checkpointing False --batch-size 16

What can I do to resolve this?

synesthesiam commented 1 year ago

I've seen this before when the "world size" isn't correct. It will wait forever for nodes that aren't there. I believe I set the WORLD_SIZE environment variable to the number of GPUs I wanted to use.

Laope94 commented 1 year ago

Unfortunately same behaviour after export WORLD_SIZE=2. It's set though, I've checked with echo, but I am getting very same message.

versae commented 1 year ago

I can confirm this also happens when training on TPUs. For TPUv3-8, I can only use one TPU core. When using the 8 cores the training code hangs with:

TPU available: True, using: 8 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
DEBUG:piper_train:Checkpoints will be saved every 1 epoch(s)
DEBUG:vits.dataset:Loading dataset: /data/nst_tts/piper_test_female_50_6/dataset.jsonl
WARNING:root:Unsupported nprocs (8), ignoring...

DogeLord081 commented 1 year ago

I can confirm this also happens when training on TPUs. For TPUv3-8, I can only use one TPU core. When using the 8 cores the training code hangs with:
TPU available: True, using: 8 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
DEBUG:piper_train:Checkpoints will be saved every 1 epoch(s)
DEBUG:vits.dataset:Loading dataset: /data/nst_tts/piper_test_female_50_6/dataset.jsonl
WARNING:root:Unsupported nprocs (8), ignoring...

How exactly did you train with TPU? I'm trying on colab but I'm having many issues

versae commented 1 year ago

Not sure about Colab TPUs, I used a GCP TPUv3-8, and I followed the PyTorch Lightning documentation. It was a nightmare to set up.

aaronnewsome commented 7 months ago

I was able to train on 2 GPU on a single node by adding --devices 2 --gpus='0,1' and --strategy=ddp. I let it run for a few hours and a few checkpoints were saved, both GPUs were being utilized, so I assumed it was working. However, I've been futzing with trying to get multi-node training working and not having any luck at all. I've tried fiddling with envars WORLD_SIZE, NODE_RANK, MASTER_ADDR, MASTER_PORT, etc. I've tried to see if I could figure it out by following along at

https://lightning.ai/docs/pytorch/stable/accelerators/gpu_intermediate.html

and still being able to get multi-node training working. I realize probably because I have no idea what I'm doing, but that isn't stopping me from trying.

I started my own issue here but got zero responses after 3 weeks. If anyone following this thread has any idea of how to get multi-node training working, I'd be grateful if you could point me in the right direction. Can this be done by adding/altering arguments to the training examples? Does some code need to be changed in the piper source? Is it just not possible to train piper voices on multi-nodes?