kan-bayashi / PytorchWaveNetVocoder

WaveNet-Vocoder implementation with pytorch.
https://kan-bayashi.github.io/WaveNetVocoderSamples/
Apache License 2.0
297 stars 57 forks source link

paralle training #48

Closed yfliao closed 5 years ago

yfliao commented 5 years ago

Dear Tomoki,

Is it possible to run parallel training/conversion using more than one machine at the same time?

In our working environment, there are 4 machines, each with 2 GPUs and Slurm had been well installed. However, it seems that only one machine could be allocated for stage 4 and 5. For example:

sinfo

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST P1 up infinite 1 alloc gccn01 P1 up infinite 4 idle gccn[02-04],gchead

squeue

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 46 P1 tr.sh liao R 8:55 1 gccn01

Thanks for your help and have a nice day!

Best Regards, Yuanfu

PS: Here are the environment settings:

#

run.sh

# n_gpus=2 n_quantize=256 n_aux=80 n_resch=512 n_skipch=256 dilation_depth=10 dilation_repeat=3 kernel_size=2 lr=1e-4 weight_decay=0.0 iters=200000 batch_length=20000 batch_size=8 checkpoints=1000 use_upsampling=true use_noise_shaping=true resume=

#

cmd.sh

# export train_cmd="slurm.pl --config conf/slurm.conf" export cuda_cmd="slurm.pl --gpu 1 --config conf/slurm.conf"

export max_jobs=-1

#

slurs.conf

#

COMPUTE NODES

GresTypes=gpu NodeName=gchead Gres=gpu:0 CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 RealMemory=128815 State=UNKNOWN NodeName=gccn0[1-4] Gres=gpu:2 CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 RealMemory=128815 State=UNKNOWN PartitionName=P1 Nodes=gchead,gccn0[1-4] Default=YES MaxTime=INFINITE State=UP

#

gres.conf

# Name=gpu Type=tesla File=/dev/nvidia0 Cores=0,1 Name=gpu Type=tesla File=/dev/nvidia1 Cores=0,1

#

control show node

# NodeName=gccn01 Arch=x86_64 CoresPerSocket=10 CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.08 Features=(null) Gres=gpu:2 NodeAddr=gccn01 NodeHostName=gccn01 Version=15.08 OS=Linux RealMemory=128815 AllocMem=0 FreeMem=123504 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A BootTime=2019-03-05T17:01:46 SlurmdStartTime=2019-03-10T18:32:19 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=gccn02 Arch=x86_64 CoresPerSocket=10 CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.20 Features=(null) Gres=gpu:2 NodeAddr=gccn02 NodeHostName=gccn02 Version=15.08 OS=Linux RealMemory=128815 AllocMem=0 FreeMem=1974 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A BootTime=2018-02-04T18:11:19 SlurmdStartTime=2019-03-10T18:32:24 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=gccn03 Arch=x86_64 CoresPerSocket=10 CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.15 Features=(null) Gres=gpu:2 NodeAddr=gccn03 NodeHostName=gccn03 Version=15.08 OS=Linux RealMemory=128815 AllocMem=0 FreeMem=113919 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A BootTime=2019-03-05T17:24:43 SlurmdStartTime=2019-03-10T18:32:28 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=gccn04 Arch=x86_64 CoresPerSocket=10 CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.28 Features=(null) Gres=gpu:2 NodeAddr=gccn04 NodeHostName=gccn04 Version=15.08 OS=Linux RealMemory=128815 AllocMem=0 FreeMem=126584 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A BootTime=2019-03-05T17:25:50 SlurmdStartTime=2019-03-10T18:32:32 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

NodeName=gchead Arch=x86_64 CoresPerSocket=10 CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.00 Features=(null) Gres=(null) NodeAddr=gchead NodeHostName=gchead Version=15.08 OS=Linux RealMemory=128815 AllocMem=0 FreeMem=114584 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A BootTime=2019-02-28T18:49:40 SlurmdStartTime=2019-03-10T18:32:47 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

kan-bayashi commented 5 years ago

Sorry for too late reply. Yes, you can train several models at the same time. For example,

# after finish stage 0123
$ sbatch run.sh --stage 456 --lr 1e-4
$ sbatch run.sh --stage 456 --lr 1e-3

In above case, two models will be trained with different hyper parameters.