Questions Regarding Article

alexm-gc commented 2 years ago

Congratulations on very good work! I'm very impressed with what you've been able to accomplish, especially taking the limited compute budget into account!

Question 0. In Table 1 you report the model parameters and training time for three datasets, but not ImageNet64. Do you have the time/parameter numbers for ImageNet64? Are they the same as ImageNet32?

Question 1. FID computation. You write the for image generation you sample from N(0, 0.8). Did you compute FID using this N(0, 0.8) or N(0, I)? I ask because I believe FID is slightly biased in favor of GANs and denoising diffusion models, because they usually trade-off quality of individual images for variability. You could find the sweet point in this trade-off for DenseFlow by computing FID for N(0, alpha) where alpha=0.5,0.6,...,1. I'd be curious to see the best number you'd be able to get.

Question 2. Since NF sample ~num_pixels faster than autoregressive models, I'm curious as to whether we could improve samples at the cost of 10-20x longer sampling time. For example, we could sample a batch of 128 fake images, then perform SGD minimizing LLH wrt the 128 fake images. Even if we do 100 SGD steps we'd be ~num_pixels/100 times faster than autoregressive models.

Question 3. Do you think increased network width worked better on ImageNet32 or ImageNet64? I ask because I like to think of ImageNet64 as 4 channel-wise copies of ImageNet32x32, which effectively increases the network width. I imagine if we go to 128x128 or 256x256 the network depth issue of Normalizing Flows may be smaller. What do you think?

Question 4. Did you find batch_size=32 to optimize better for ImageNet64, or was this mainly for memory savings?

Question 5. Did you try using gradient checkpointing?

Question 6. Did you try training in float16?

Question 7. How large was GPU utilization on ImageNet64 with batch_size=32? I.e., if you wrote nvidia-smi would it tell you 50% or 100%?

matejgrcic commented 2 years ago

Hi, thank you for your interest in our work.

A0. ImageNet64 model has the same architecture as the model used for ImageNet32. Training duration is also similar.

A1. We reported FID of images created by sampling from N(0, I). Interesting proposal. We will publish the script for samples generation at various temperatures. It might help if you are interested in trying it out by yourself.

A2. Another interesting idea. Have you already tried it?

A3. High-resolution images usually contain more details. We believe that the proposed incremental augmentation of latent representations would still improve results. However, modelling high-resolution images which are not semantically rich probably wouldn't result in significant performance improvement. A bigger DenseFlow model would achieve better results on ImageNet64.

A4. We set the batch size to 32 due to limited GPU resources.

A5. The tradeoff offered by gradient checkpointing wasn't in our interest since the training duration was already high.

A6. Numerically unstable.

A7. Our utilization was poor (no more than 50%). We are interested in any proposals regarding the utilization improvement.

Hope this helps. Feel free to ask any other question.

alexm-gc commented 2 years ago

Thanks for the quick reply.

Q0. How does one DenseFlow architecture handle 32x32 and 64x64 sized inputs? Does the 64x64 variant have less augmentation in the first layer compared to the 32x32 variant?

Q1. Interested to see how much improvement it would give.

Q2. No, but I've seen papers doing slightly related things. I'm quite optimistic this would improve sample quality.

Q3. Do you have any intuition about scaling laws for DenseFlow? (a) How large improvement would you expect if we train a 10x larger model? (thinking ~1.5B parameters). (b) Do you think we could compete with ImageGPT if we train a 7B parameter model within 2500 V100 days? https://openai.com/blog/image-gpt/

Q6. Do you have any idea what is causing the instability? (a) Did you try to increase the epsilon in AdaMax from 10**(-8) (this is 0 in float16). (b) Perform the weight update in mixed precision using torch.cuda.amp (amp=automatic mixed precision) The reason I ask, is that float16 would increase memory and allow bs=64 which may increase utilization from 50%.

matejgrcic commented 2 years ago

A0. We didn't change augmentations for 64x64 inputs.

A1. I'll notify you once the FID scores are calculated.

A2. Could you link those papers?

A3. I'd scale DenseFlow along three axes: i) enlarge the capacity of coupling networks ii) increase the number of invertible units in each DenseFlow block iii) increase the number of DenseFlow blocks. I would also employ more complex augmentations e.g. http://proceedings.mlr.press/v119/jun20a/jun20a.pdf

A6. I have some ideas but didn't manage to conduct any experiments regarding the FP16 training after the initial failure. If you are really interested in such huge scaling of DenseFlow, I'd happily offer you assistance. If needed we can continue the discussion via email: matej.grcic@fer.hr

matejgrcic / DenseFlow

Questions Regarding Article #4