Closed gwern closed 4 years ago
We thank you for your attention to these concurrent works.
I suppose I mean, did you guys try path length regularization at all? If you did not, then obviously you would not have run into any issues.
If you did and you got it running, we are curious how you did it because our attempt simply failed & we have no idea how to fix it.
It seems like they do not. Like in their pseudocode in Appendix C/D: in both algorithms, the G training step is very explicitly transformation-free:
It's just the standard D(G(X)) pass. There is no transformation. They also strongly imply that they only augment D samples by all of the descriptions of how it is "a novel way where both real and generated images are augmented before fed into the discriminator"; this would be odd phrasing if they were training G, aside from the pseudocode, because there would only be reals being augmented there.
Yes, I hope we'll be able to do it soon. Sid has code like yours working on GPU, it's just that as usual, TPUs have unique bugs. If we could just solve that, we could easily start a BigGAN run with data augs on a TPU-512 and see for our selves if this is all that the 4 separate papers make it out to be... I'm intrigued that it helps even at ImageNet scale. We have several times more anime images, but is that enough to stress a BigGAN D? I hope we'll find out.
Sure!
- OK, thanks.
- That is a good point, the pseudocode and that particular description appear to contradict each other. There's no source code for me to consult, so I will email them and ask. This is an important point, so we ought to get clarity on it.
- I am not disagreeing there; I believe a combination is probably going to be better than a single augmentation, and I was a little disappointed that the other papers did not do as much in that direction as I would've liked. My question is more, can you push your current data augs even further? As it is, it seems like you could use many additional augs and the augs you use could potentially be stronger. (Although of course, this being machine learning, nothing can be taken for granted and it could be that you have already gone too far.)
Do you get response from authors about question 2?
The core insight of this paper, that doing data augmentation on the reals and fakes while training D, has been recently published by (at least) 3 other papers: Zhao, Tran, and Karras (in that chronological order). A comparison and contrast with the differing results in those papers would be very useful for the README & future versions of this paper.
In particular, I would like to know: did you simply disable path length regularization in StyleGAN2 rather than work around the higher-order gradient issues? Why do you think your D-only augmentation diverged when Zhao (the first) does all their experiments with only D augmentation without any issue at all? Did you experiment with stronger or weaker settings for each data augmentation to understand if the stack of multiple data augmentations is collectively too weak or too strong? Also, one part of the paper seems ambiguous: how exactly are the data augmentations done - does it pick one augmentation at random per batch, one augmentation per image, or does it apply all 1/2/3 augmentations to each image as a stack? The paper seems to suggest, given the emphasis on strong augmentation, that it's applying as a stack, but it never actually seems to say (and looking at the source code didn't help).