Closed yfliao closed 5 years ago
Sorry for too late reply. Yes, you can train several models at the same time. For example,
# after finish stage 0123
$ sbatch run.sh --stage 456 --lr 1e-4
$ sbatch run.sh --stage 456 --lr 1e-3
In above case, two models will be trained with different hyper parameters.
Dear Tomoki,
Is it possible to run parallel training/conversion using more than one machine at the same time?
In our working environment, there are 4 machines, each with 2 GPUs and Slurm had been well installed. However, it seems that only one machine could be allocated for stage 4 and 5. For example:
sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST P1 up infinite 1 alloc gccn01 P1 up infinite 4 idle gccn[02-04],gchead
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 46 P1 tr.sh liao R 8:55 1 gccn01
Thanks for your help and have a nice day!
Best Regards, Yuanfu
PS: Here are the environment settings:
#
run.sh
# n_gpus=2 n_quantize=256 n_aux=80 n_resch=512 n_skipch=256 dilation_depth=10 dilation_repeat=3 kernel_size=2 lr=1e-4 weight_decay=0.0 iters=200000 batch_length=20000 batch_size=8 checkpoints=1000 use_upsampling=true use_noise_shaping=true resume=
#
cmd.sh
# export train_cmd="slurm.pl --config conf/slurm.conf" export cuda_cmd="slurm.pl --gpu 1 --config conf/slurm.conf"
export max_jobs=-1
#
slurs.conf
#
COMPUTE NODES
GresTypes=gpu NodeName=gchead Gres=gpu:0 CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 RealMemory=128815 State=UNKNOWN NodeName=gccn0[1-4] Gres=gpu:2 CPUs=20 Sockets=2 CoresPerSocket=10 ThreadsPerCore=1 RealMemory=128815 State=UNKNOWN PartitionName=P1 Nodes=gchead,gccn0[1-4] Default=YES MaxTime=INFINITE State=UP
#
gres.conf
# Name=gpu Type=tesla File=/dev/nvidia0 Cores=0,1 Name=gpu Type=tesla File=/dev/nvidia1 Cores=0,1
#
control show node
# NodeName=gccn01 Arch=x86_64 CoresPerSocket=10 CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.08 Features=(null) Gres=gpu:2 NodeAddr=gccn01 NodeHostName=gccn01 Version=15.08 OS=Linux RealMemory=128815 AllocMem=0 FreeMem=123504 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A BootTime=2019-03-05T17:01:46 SlurmdStartTime=2019-03-10T18:32:19 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=gccn02 Arch=x86_64 CoresPerSocket=10 CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.20 Features=(null) Gres=gpu:2 NodeAddr=gccn02 NodeHostName=gccn02 Version=15.08 OS=Linux RealMemory=128815 AllocMem=0 FreeMem=1974 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A BootTime=2018-02-04T18:11:19 SlurmdStartTime=2019-03-10T18:32:24 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=gccn03 Arch=x86_64 CoresPerSocket=10 CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.15 Features=(null) Gres=gpu:2 NodeAddr=gccn03 NodeHostName=gccn03 Version=15.08 OS=Linux RealMemory=128815 AllocMem=0 FreeMem=113919 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A BootTime=2019-03-05T17:24:43 SlurmdStartTime=2019-03-10T18:32:28 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=gccn04 Arch=x86_64 CoresPerSocket=10 CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.28 Features=(null) Gres=gpu:2 NodeAddr=gccn04 NodeHostName=gccn04 Version=15.08 OS=Linux RealMemory=128815 AllocMem=0 FreeMem=126584 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A BootTime=2019-03-05T17:25:50 SlurmdStartTime=2019-03-10T18:32:32 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=gchead Arch=x86_64 CoresPerSocket=10 CPUAlloc=0 CPUErr=0 CPUTot=20 CPULoad=0.00 Features=(null) Gres=(null) NodeAddr=gchead NodeHostName=gchead Version=15.08 OS=Linux RealMemory=128815 AllocMem=0 FreeMem=114584 Sockets=2 Boards=1 State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 Owner=N/A BootTime=2019-02-28T18:49:40 SlurmdStartTime=2019-03-10T18:32:47 CapWatts=n/a CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s