openai / glow

Code for reproducing results in "Glow: Generative Flow with Invertible 1x1 Convolutions"
https://arxiv.org/abs/1807.03039
MIT License
3.11k stars 515 forks source link

train error when use multiGpus #69

Closed paulchou0309 closed 5 years ago

paulchou0309 commented 5 years ago

Run mpiexec -n 4 python3 train.py --problem celeba --image_size 256 --n_level 6 --depth 32 --flow_permutation 2 --flow_coupling 0 --seed 0 --learntop --lr 0.001 --n_bits_x 5

Error : Rank 2 Batch sizes Train 1 Test 1 Init 4 Rank 1 Batch sizes Train 1 Test 1 Init 4 Traceback (most recent call last): File "train.py", line 413, in main(hps) File "train.py", line 145, in main train_iterator, test_iterator, data_init = get_data(hps, sess) File "train.py", line 108, in get_data hps.local_batch_test, hps.local_batch_init, hps.image_size, hps.rnd_crop) File "/AI/home/zhoujia/Image/glow/data_loaders/get_data.py", line 63, in get_data train_file = get_tfr_file(data_dir, 'train', int(np.log2(resolution))) File "/AI/home/zhoujia/Image/glow/data_loaders/get_data.py", line 55, in get_tfr_file assert len(files) == int(files[0].split( IndexError: list index out of range Rank 3 Batch sizes Train 1 Test 1 Init 4 Traceback (most recent call last): File "train.py", line 413, in main(hps) File "train.py", line 145, in main train_iterator, test_iterator, data_init = get_data(hps, sess) File "train.py", line 108, in get_data hps.local_batch_test, hps.local_batch_init, hps.image_size, hps.rnd_crop) File "/AI/home/zhoujia/Image/glow/data_loaders/get_data.py", line 63, in get_data train_file = get_tfr_file(data_dir, 'train', int(np.log2(resolution))) File "/AI/home/zhoujia/Image/glow/data_loaders/get_data.py", line 55, in get_tfr_file assert len(files) == int(files[0].split( IndexError: list index out of range Traceback (most recent call last): File "train.py", line 413, in main(hps) File "train.py", line 145, in main train_iterator, test_iterator, data_init = get_data(hps, sess) File "train.py", line 108, in get_data hps.local_batch_test, hps.local_batch_init, hps.image_size, hps.rnd_crop) File "/AI/home/zhoujia/Image/glow/data_loaders/get_data.py", line 63, in get_data train_file = get_tfr_file(data_dir, 'train', int(np.log2(resolution))) File "/AI/home/zhoujia/Image/glow/data_loaders/get_data.py", line 55, in get_tfr_file assert len(files) == int(files[0].split( IndexError: list index out of range Rank 0 Batch sizes Train 1 Test 1 Init 4

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpiexec detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[45327,1],2] Exit code: 1

How can make it works?