tensorfork / tensorfork

23 stars 2 forks source link

GAN scaling: bigger = better faster? (chaos runs & the Bitter Lesson) #28

Open gwern opened 4 years ago

gwern commented 4 years ago

'Chaos' runs where we combine several of our datasets (typically Danbooru/Danbooru faces/ImageNet/Flickr-(3M subset)) seem to yield relatively realistic figures and images early in training, despite lumping together disparate datasets, for both StyleGAN and BigGAN. While one might expect that going from n~2m (Danbooru SFW) to n=2m+300k+1m+3m=~6.3m might catastrophically overload the model or trigger constant divergence, particularly using unconditional BigGAN/StyleGAN, this does not seem to be the case.

For example, a StyleGAN chaos run (run74-chaos-512/model.ckpt-874496 seed 5678):

test

This is oddly good for StyleGAN, which struggles badly at anything more complex than single centered objects like faces, especially given that ImageNet/Flickr appear to have little to gain from training simultaneously with Danbooru, and much to lose.

But it also works for BigGAN, where even at ~33k iterations, we see fairly good landscapes, recognizable peoples, and recognizable anime girls (Twitter samples). Here is the last sample from iteration 43k (#418 step 43950 elapsed 749.44m 2020-05-11 15:24:21 PST .../runs/bigrun61/logs/images/run/images-2020-05-11-09-40-13):

test

(Another one from #45,350.) For comparison, one of our Danbooru+Danbooru-faces 256px 128ch BigGAN runs at roughly the same iteration count is not that much better than the anime samples:

test

For further comparison, a sample from one of our best 256px Danbooru runs at >10x more iterations, 379k (#2229 step 379750 elapsed 4781.88m 2020-05-06 04:57:54 PST .../runs/bigrun40/images/):

test

(Another one from #461,600.)

The BigGAN paper notes that training on the n=300m JFT-300M internal Google dataset appears to stabilize BigGAN greatly:

In Figure 19 (Appendix D), we present truncation plots for models trained on this dataset. Unlike for ImageNet, where truncation limits of σ≈0 tend to produce the highest fidelity scores, IS is typically maximized for our JFT-300M models when the truncation valueσranges from 0.5 to 1. We suspect that this is at least partially due to the intra-class variability of JFT-300M labels, as well as the relative complexity of the image distribution, which includes images with multiple objects at a variety of scales. Interestingly, unlike models trained on ImageNet, where training tends to collapse without heavy regularization (Section 4), the models trained on JFT-300M remain stable over many hundreds of thousands of iterations. This suggests that moving beyond ImageNet to larger datasets may partially alleviate GAN stability issues.

This is particularly striking when you note that ImageNet has been studied to death, and BigGAN and other GAN research have tuned hyperparameters for ImageNet using collectively thousands of runs, while almost no one uses JFT-300M and no one can afford many runs for hyperparameter tuning, and at 300x the size, any instability ought to be massively increased - unless the data scale itself is an enormously stabilizing factor such that even one's initial naive guesses at hyperparameters Just Work...?

This is just one of many papers showing that neural nets benefit greatly from increased model & data size*. This could be a bitter lesson for GANs: what we really needed all along was just to train on (much) larger datasets. (Training on such diverse datasets may also help fix our transfer learning problems.)

We can investigate this by ablation (training on separate datasets) and also simply by grabbing more images. We could easily pull much more than 3M Flickr images from YFCC100M, and we can crop more than just faces out of Danbooru/e621. shawwn is suspicious that e621 screws things up, but I think we can just add that in to get another million or so. Are there other datasets we could use? There are a number of illustration datasets like WikiArt or Open Images or BAM or Derpibooru or I think roadrunner01 has some portrait datasets that'd be useful**.

* https://arxiv.org/abs/2001.08361#openai https://arxiv.org/abs/1909.12673 https://arxiv.org/abs/1712.00409#baidu https://arxiv.org/abs/1811.03600#google https://arxiv.org/abs/2003.02139 https://arxiv.org/abs/2002.08791 https://arxiv.org/abs/2001.09977#google https://arxiv.org/abs/1912.11370 https://arxiv.org/abs/2002.11794 https://arxiv.org/abs/1906.06669v1

** unfortunately, no, roadrunning01's datasets are generally too small and aimed at transfer learning/finetuning

shawwn commented 4 years ago

There is another factor at play: As far as I know, we're the only ones using @Skylion007's technique of "this is actually a conditional BigGAN, but we're pretending each datapoint has a random label."

I'm not sure whether that might influence the stability of BigGAN, but I can't help but wonder if a true unconditional BigGAN might have the same dynamics.

gwern commented 4 years ago

An additional note: our recently released run39 256px BigGAN was trained on Danbooru2019+anime portraits+e621+e621 portraits, for a cumulative n>3m. (The 4 runs there were not 'chaos' runs because they excluded Flickr & ImageNet.)

Despite radically destabilizing by adding in the latter 3 datasets halfway through training, run39 was still training well up until we halted for unrelated reasons around iteration 607k.

Arguably, the initial destabilization was handled primarily by #22 flood loss, but then what about the hundreds of thousands of iterations after that? The BigGAN paper notes collapse at iterations such as 125k, 200k etc; flood loss can stop learning at a collapse, but then it should get 'stuck' as it re-collapses every iteration without being able to escape its trap (as BigGAN paper notes, even resetting tens of thousands of iterations earlier still results in collapse around the same time). It's possible we have started to reach the data island of stability with Danbooru2019+others. If so, mixing in the data augmentations (#29) ought to further stabilize BigGAN. We might also consider adding in additional illustration datasets (see above).

I'm not sure whether that might influence the stability of BigGAN, but I can't help but wonder if a true unconditional BigGAN might have the same dynamics.

See #26 where we speculate that the random label is possibly substituting for insufficient random noise injected into each layer as compared to StyleGAN.