anyone reproduced the celeba-HQ results in the paper

winwinJJiang commented 6 years ago

Hi, does any one re-produced the HQ (256*256) image? I have problem with GPUs can not train toooo long time.

gwern commented 6 years ago

The Glow paper is very unclear on the computational demands, but if you look at the README's example command for CelebA, or the blog post, you'll see that they train the CelebA model on 40 GPUs for an unspecified (but probably >week) of time. That's almost 1 GPU-year, so it's no surprise that people trying it out on 1 or 2 GPUs (like myself) for a few days - or weeks at most - haven't reached similar results.

If you just want to generate 256px images, you might be better off with ProGAN; at least until Glow gets self-attention or progressive growing, it won't be competitive. Consider what it would cost to reproduce on AWS spot right now: ~\$0.3/h for a single p2.xlarge instance at spot; 40 of those, for say a week which is 7*24=168h; 0.3*40*168=$2,016 - assuming nothing goes wrong. (And things might well go wrong: I've run into a bunch of Glow crashes where it crashes due to lacking 'invertibility'. There's no mention of this in the Glow repo or issues, and the default checkpointing is very rare, so I assume it wasn't a problem because of very large minibatches from using 40 GPUs.)

prafullasd commented 6 years ago

Yes we trained with 40 GPU's for about a week, but samples did start to look good after a couple of days. If you're getting invertibility errors with small batch sizes, try increasing the warmup epochs or decrease the learning rate. A repository that seems to be able to get similar results to ours is - https://github.com/musyoku/chainer-glow

To train faster, you could work on a smaller resolution, use a smaller model or try to tweak the learning rate / optimizer for faster convergence (especially if you're using big batch sizes). If you want to use a larger minibatch per GPU, you can try implementing the O(1) memory version, which uses the reversibility to not have to store activations while backpropogating, thus using GPU memory that is independent of depth of model. An example implementation of O(1) memory reversible flow models in Tensorflow is here (this one does RealNVP) - https://github.com/unixpickle/cnn-toys/tree/master/cnn_toys/real_nvp

iRmantou commented 6 years ago

Hi @prafullasd, I have read your paper and codes, the results are amazing, but I am new about horovod, I notice your commands use mpiexec ... without “-H” or other parameters , It's very simple compared with usage example in horovod github site as follows：

run on 4 machines with 4 GPUs each

$ mpirun -np 16 \ -H server1:4,server2:4,server3:4,server4:4 \ -bind-to none -map-by slot \ -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \ -mca pml ob1 -mca btl ^openib \ python train.py

Is there anything I missed? please give me some advice， thank you very much！ @gwern how much gpu you used？ in one machine or a GPUs cluster？

gwern commented 6 years ago

2x1080ti.

iRmantou commented 6 years ago

@gwern thanks for your reply! could u give me the commands you used? just mpiexec -n 8 python train.py --problem cifar ... without any other parameters? such like -H -bind-to none and so on

gwern commented 6 years ago

Yes, I just copied their command with mpiexec -n 2 (since I only have 2 GPUs, of course) and it worked. I didn't add any of the stuff you mentioned.

iRmantou commented 6 years ago

@gwern Thank you so much！

Avmb commented 6 years ago

@prafullasd about the invertibility issue, would it make sense to force approximate orthogonality of the 1x1 convolutions using a penalty? You'll avoid non-invertibility and numerical instability errors, moreover if the approximation is good enough, you can even save computation time by replacing matrix inversion with transpose and removing the determinant computation (~~you just need to do it once at init to determine whether it's 1 or -1, then it stays the same during training~~ actually you don't need even that since only the absolute value matters).

nshepperd commented 6 years ago

A good alternative to fix the invertibility issue would be to use the LU decomposition (which is included in the code, in model.py, but not used by default), with the diagonal entries of both triangular matrices fixed to 1 (which is not currently the case in the code). This would fix the determinant to 1 and ensure the matrix is always invertible.

Forcing approximate orthogonality with a penalty term is not a bad idea as well.

nshepperd commented 6 years ago

To follow up on this, I implemented the orthogonality penalty, as a simple -20*||(w'w - I)||_F^2 term in the objective function (at invertible_1x1_conv). That is, the summed elementwise squared difference between w'w and the identity matrix, where w is the weights of the 1x1 convolution. 20 was the lowest penalty multiplier that seemed to reliably keep the total squared difference small (<0.4).

After this, I still had an invertibility crash, so I thought it had to be some sudden spiking gradient issue / numerical instability bringing it away from invertibility, as the orthogonality penalty would have brought it back to invertibility if it was something gradual. Looking at the revnet2d_step, I saw that the code applies a sigmoid function to the scale factors, to produce the value "s" from the paper, (interestingly, yes, sigmoid, not exp as in the paper). I was pretty suspicious of this sigmoid, as it can get arbitrarily close to 0, which means that the log(s) calculated for the determinant (as well as the 1/s for the reverse step) could in principle produce an arbitrarily large value, and hence gradient...

My solution to this was to add an epsilon (currently 0.1, but I haven't experimented with this hyperparameter much yet) to the output of tf.nn.sigmoid(h[:, :, :, 1::2] + 2.), to constrain it to be >>0. I haven't had an invertibility crash with this yet, after running all day, and the epsilon doesn't seem to meaningfully affect the model power. This has also had the positive effect of removing some artifacts in the samples that are clearly due to that 1/s becoming very large.

Avmb commented 6 years ago

Bounding the s away from zero makes sense. I suppose they didn't do it because the optimization objective generally maximizes |s|, which is probably why they used a sigmoid instead of an exp, but in some cases the model may try go for a lower s for some reason (maybe to reduce the entropy if the input has too much compared to the target latent?)

nuges01 commented 6 years ago

@nshepperd, did your solution in your last paragraph above end up fixing the issue as you continued training beyond 1 day?

Also, would you say it was the combination of modifications you made that fixed it, or would it suffice to just add the epsilon? Would you mind adding snippets of your changes for the rest of us who are struggling with the issue? Thanks!

openai / glow

anyone reproduced the celeba-HQ results in the paper #37