Comparison with the 3 other DiffAugment papers published

gwern commented 4 years ago

The core insight of this paper, that doing data augmentation on the reals and fakes while training D, has been recently published by (at least) 3 other papers: Zhao, Tran, and Karras (in that chronological order). A comparison and contrast with the differing results in those papers would be very useful for the README & future versions of this paper.

In particular, I would like to know: did you simply disable path length regularization in StyleGAN2 rather than work around the higher-order gradient issues? Why do you think your D-only augmentation diverged when Zhao (the first) does all their experiments with only D augmentation without any issue at all? Did you experiment with stronger or weaker settings for each data augmentation to understand if the stack of multiple data augmentations is collectively too weak or too strong? Also, one part of the paper seems ambiguous: how exactly are the data augmentations done - does it pick one augmentation at random per batch, one augmentation per image, or does it apply all 1/2/3 augmentations to each image as a stack? The paper seems to suggest, given the emphasis on strong augmentation, that it's applying as a stack, but it never actually seems to say (and looking at the source code didn't help).

zsyzzsoft commented 4 years ago

We thank you for your attention to these concurrent works.

Did you simply disable path length regularization in StyleGAN2 rather than work around the higher-order gradient issues? I'm not sure what you mean by "higher-order gradient issues", but path length regularization is disabled only because it does not contribute to the FID.
Why do you think your D-only augmentation diverged when Zhao (the first) does all their experiments with only D augmentation without any issue at all? It seems to me that they also augment for not only D but also G.
Did you experiment with stronger or weaker settings for each data augmentation to understand if the stack of multiple data augmentations is collectively too weak or too strong? We show that simply applying a combination of fixed types of augmentations is good enough to achieve good results. Tuning the strength of each augmentation does not help understanding whether the collective effect is too weak or too strong. This can be done by looking at the validation accuracy.
Also, one part of the paper seems ambiguous: how exactly are the data augmentations done - does it pick one augmentation at random per batch, one augmentation per image, or does it apply all 1/2/3 augmentations to each image as a stack? It is applied as a stack.

gwern commented 4 years ago

I suppose I mean, did you guys try path length regularization at all? If you did not, then obviously you would not have run into any issues.

If you did and you got it running, we are curious how you did it because our attempt simply failed & we have no idea how to fix it.
It seems like they do not. Like in their pseudocode in Appendix C/D: in both algorithms, the G training step is very explicitly transformation-free:

It's just the standard D(G(X)) pass. There is no transformation. They also strongly imply that they only augment D samples by all of the descriptions of how it is "a novel way where both real and generated images are augmented before fed into the discriminator"; this would be odd phrasing if they were training G, aside from the pseudocode, because there would only be reals being augmented there.
Sure, but you do limited runs. How do you know that the translate+cutout+color run is the best that is possible, and it could not be made better by increasing (or decreasing) translate from 1/8 and so on? At least with the first Zhao, the strength of each data aug directly affected the final quality and not in any simple linear way, and it was possible to be too strong (particularly for the color augs).
Thanks.

zsyzzsoft commented 4 years ago

We did not try it at all.
In their page 4, Section 3.2., "Different from augmenting real images, we keep the gradients for augmented generated images to train the generator." I suppose that their pseudocode has some mistake.
What I want to say is that a simple combination strategy is good enough to achieve good results, but we do not claim that a combination is always better than applying one augmentation only (you may try extensive hyperparameter tuning) and we do not really care about it.

gwern commented 4 years ago

OK, thanks.
That is a good point, the pseudocode and that particular description appear to contradict each other. There's no source code for me to consult, so I will email them and ask. This is an important point, so we ought to get clarity on it.
I am not disagreeing there; I believe a combination is probably going to be better than a single augmentation, and I was a little disappointed that the other papers did not do as much in that direction as I would've liked. My question is more, can you push your current data augs even further? As it is, it seems like you could use many additional augs and the augs you use could potentially be stronger. (Although of course, this being machine learning, nothing can be taken for granted and it could be that you have already gone too far.)

zsyzzsoft commented 4 years ago

Yes, it is true that more augmentations and more sophisticated hyperparameter tuning can be applied, but that would be much more complicated for us and also for the users. Welcome to discover more using our code! Would be glad to hear from you.

gwern commented 4 years ago

Yes, I hope we'll be able to do it soon. Sid has code like yours working on GPU, it's just that as usual, TPUs have unique bugs. If we could just solve that, we could easily start a BigGAN run with data augs on a TPU-512 and see for our selves if this is all that the 4 separate papers make it out to be... I'm intrigued that it helps even at ImageNet scale. We have several times more anime images, but is that enough to stress a BigGAN D? I hope we'll find out.

zsyzzsoft commented 4 years ago

Sure!

lioo717 commented 4 years ago

OK, thanks.

That is a good point, the pseudocode and that particular description appear to contradict each other. There's no source code for me to consult, so I will email them and ask. This is an important point, so we ought to get clarity on it.

I am not disagreeing there; I believe a combination is probably going to be better than a single augmentation, and I was a little disappointed that the other papers did not do as much in that direction as I would've liked. My question is more, can you push your current data augs even further? As it is, it seems like you could use many additional augs and the augs you use could potentially be stronger. (Although of course, this being machine learning, nothing can be taken for granted and it could be that you have already gone too far.)

Do you get response from authors about question 2?

mit-han-lab / data-efficient-gans

Comparison with the 3 other DiffAugment papers published #1