Open lsspear opened 5 years ago
As an update, I've left it training on my own data (20x20 images, and with different depth/levels/layers) and this is what I've gotten so far:
loganspear@mpc-research-vm:~/chainer-glow/run$ python3 train.py -dataset /home/loganspear/INCLUDE_ABS_resids_nseg21_nov16_nfft38_npy -b 16 -depth 16 -levels 2 -nn 256 -bits 5 -ext npy -gpu 0
---- ---------------
# 69500
mean -0.509841
var 0.000216433
---- ---------------
------------------ --------
image_size (20, 20)
nn_hidden_channels 256
lu_decomposition False
num_bits_x 5
depth_per_level 16
levels 2
squeeze_factor 2
------------------ --------
loading snapshot/model.hdf5
Iteration 1: Batch 1232 / 4343 - loss: nan - nll: nan - kld: 0.00000000 - log_det: nan
Iteration 1 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 35.599 min
Iteration 2 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 35.353 min
Iteration 3 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 35.157 min
Iteration 4 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 35.238 min
Iteration 5 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 35.225 min
Iteration 6 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 35.382 min
Iteration 7 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 35.091 min
Iteration 8 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 34.967 min
Iteration 9 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 35.133 min
Iteration 10 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 35.366 min
Iteration 11 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 34.815 min
Iteration 12 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 34.650 min
Iteration 13 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 34.799 min
Iteration 14 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 34.880 min
Iteration 15 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 34.924 min
Iteration 16 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 34.863 min
Iteration 17 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 34.795 min
Iteration 18 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 34.813 min
Iteration 19 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 34.800 min
Iteration 20 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 34.856 min
Iteration 21 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 34.741 min
Iteration 22 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 34.878 min
Iteration 23 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 34.893 min
Iteration 24 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 34.791 min
Iteration 25 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 34.868 min
Iteration 26 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 35.084 min
Iteration 27 - loss: nan - log_likelihood: nan - kld: 0.00000 - elapsed_time: 34.958 min
Iteration 28: Batch 758 / 4343 - loss: nan - nll: nan - kld: 0.00000000 - log_det: nan
Hi I trained on my laptop and I didn't get NaN.
python3 train.py -dataset ../../celeba-64x64-images-npy/ -b 4 -depth 32 -levels 4 -nn 512 -bits 5 -ext npy
---- ------------
# 30000
mean -0.082911
var 0.082476
---- ------------
------------------ --------
depth_per_level 32
levels 4
num_bits_x 5
squeeze_factor 2
nn_hidden_channels 512
image_size (64, 64)
lu_decomposition False
------------------ --------
loading snapshot/model.hdf5
Iteration 1: Batch 22 / 7500 - loss: 2.96095991 - nll: 2.20780435 - kld: 0.00000000 - log_det: -0.75315560
I am also getting Nan
I've downloaded the sample 32x32 celeb dataset linked in the readme, and then used the same line to call training on this data as in the readme (except I changed the path to data to match my local setup).
When it trains, all the outputted values during training are NaN (or 0, for kld):
It's training quite slowly, but is this expected and the values will change after enough training? Or is there something wrong? I'm training in google cloud on a K80 GPU.