Any update on completing this? - Githubissues

lucidrains / pi-GAN-pytorch

Implementation of π-GAN, for 3d-aware image synthesis, in Pytorch

MIT License

116 stars 15 forks source link

Any update on completing this? #4

Open mkumar10 opened 3 years ago

mkumar10 commented 3 years ago

Seems to be missing the training code

lucidrains commented 3 years ago

I'm working on equivariant attention for alphafold2 atm, will get back to this!

lucidrains commented 3 years ago

@mkumar10 hey! I hooked up the training code today :) i haven't really sampled images to see if it is working, but at least everything runs

lucidrains commented 3 years ago

hmm, it's not working for me :( I'll try to debug it next weekend

janhenrikbern commented 3 years ago

Hi Phil,

Thank you for working on this!

I attempted to train the model v0.11 and found that it causes memory issues with a 16 gb GPU. I used the default setup (image_size = 128, dim = 512 ) and a very small dataset (~dozen imgs) to check if it runs. Did you run into similar issues during testing?

What data do you use for your training?

Godofnothing commented 3 years ago

@janhenrikbern yeah, I've faced the same issue, when trying running it on colab with 16gb gpu. Even reduction to image_size=64, dim=128 doesn't help, and even it seems that it changes nothing. I have the same estimate for training time ~30 hours, and then on 80th iteration the training process crashes.

Godofnothing commented 3 years ago

@janhenrikbern try to decrease the size of batch, default_batch_size = 8 would not fit the 16gb GPU, I've decreased batch size to 2 and now I occupy 9.5gb of memory.

janhenrikbern commented 3 years ago

@Godofnothing Thanks for the update, trying this myself now!

Were you able to learn anything?

Godofnothing commented 3 years ago

@janhenrikbern unfortunately after 10000 iteration training terminated again, one needs to keep in mind, that the resolution is increased gradually from 32 to 128, therefore the batches become more heavy. I'll try to rerun it again.

Godofnothing commented 3 years ago

@janhenrikbern any attempt to run a pi-GAN with the image resolution up to 128 fails even with the smallest size of batch=1. If one restricts himself to small final resolutions - the training does not terminate. However, I have not tested on good enough data - so maybe it is the reason, why I did not succeed in getting meaningful images. Also, the thing confusing me is that the generator loss can be negative, however the loss needs to be non-negative number. I am trying to make my own implementation with PyTorch Lightning with AMP in the Trainer. Maybe it will work. Nevertheless, such a memory consumption seems strange, because in the original paper authors managed to train the network with the batches of size 120 at the initial stage, with decrease to 12 on the highest 128 x 128 resolution. They had 2 RTX6000 which is in total 48 Gb of memory, so in principle it seems, that one should be able to operate with the batch of size 3 or 4 with the highest resolution.

krips89 commented 3 years ago

Hello people, Anyone successfully got some results (even with a small batch size)? Thanks

Godofnothing commented 3 years ago

@krips89 we have created our own version on PyTorch Lightning an it run succesfully, however, with our computing resources the obtained quality was poor

Godofnothing commented 3 years ago

@janhenrikbern @krips89 actually, I've found a problem in this implementation. When performing accumulation of the generator and discriminator loss, grads of the loss are not detached - hence, stored in the memory.

janhenrikbern commented 3 years ago

Thank you for keeping us in the look @Godofnothing! Did detaching fix the issue for you?

Godofnothing commented 3 years ago

@janhenrikbern The adapted version in PyTorch lighting implemented by me works - but the results are not satisfactory. Actually, as mentioned in other issues, there is some discrepancy between the implementation - and the architecture, described in the original paper - like FiLM conditioning. I've contacted the authors of the paper - and they say, that they will release their code rather soon, after fixing some issues and tidying the code. I would recommend to wait for the official release

Tianhang-Cheng commented 3 years ago

Has anyone successfully reproduced the result? Thanks~ ：）