Open mkumar10 opened 3 years ago
I'm working on equivariant attention for alphafold2 atm, will get back to this!
@mkumar10 hey! I hooked up the training code today :) i haven't really sampled images to see if it is working, but at least everything runs
hmm, it's not working for me :( I'll try to debug it next weekend
Hi Phil,
Thank you for working on this!
I attempted to train the model v0.11 and found that it causes memory issues with a 16 gb GPU. I used the default setup (image_size = 128, dim = 512
) and a very small dataset (~dozen imgs) to check if it runs. Did you run into similar issues during testing?
What data do you use for your training?
@janhenrikbern yeah, I've faced the same issue, when trying running it on colab with 16gb gpu. Even reduction to image_size=64, dim=128
doesn't help, and even it seems that it changes nothing. I have the same estimate for training time ~30
hours, and then on 80th iteration the training process crashes.
@janhenrikbern try to decrease the size of batch, default_batch_size = 8
would not fit the 16gb
GPU, I've decreased batch size to 2
and now I occupy 9.5gb
of memory.
@Godofnothing Thanks for the update, trying this myself now!
Were you able to learn anything?
@janhenrikbern unfortunately after 10000 iteration training terminated again, one needs to keep in mind, that the resolution is increased gradually from 32 to 128, therefore the batches become more heavy. I'll try to rerun it again.
@janhenrikbern any attempt to run a pi-GAN
with the image resolution up to 128 fails even with the smallest size of batch=1
. If one restricts himself to small final resolutions - the training does not terminate. However, I have not tested on good enough data - so maybe it is the reason, why I did not succeed in getting meaningful images. Also, the thing confusing me is that the generator
loss can be negative, however the loss needs to be non-negative number. I am trying to make my own implementation with PyTorch
Lightning
with AMP
in the Trainer
. Maybe it will work. Nevertheless, such a memory consumption seems strange, because in the original paper authors managed to train the network with the batches of size 120
at the initial stage, with decrease to 12
on the highest 128 x 128
resolution. They had 2 RTX6000
which is in total 48 Gb
of memory, so in principle it seems, that one should be able to operate with the batch of size 3
or 4
with the highest resolution.
Hello people, Anyone successfully got some results (even with a small batch size)? Thanks
@krips89 we have created our own version on PyTorch Lightning
an it run succesfully, however, with our computing resources the obtained quality was poor
@janhenrikbern @krips89 actually, I've found a problem in this implementation. When performing accumulation of the generator and discriminator loss, grads of the loss are not detached - hence, stored in the memory.
Thank you for keeping us in the look @Godofnothing! Did detaching fix the issue for you?
@janhenrikbern The adapted version in PyTorch lighting implemented by me works - but the results are not satisfactory. Actually, as mentioned in other issues, there is some discrepancy between the implementation - and the architecture, described in the original paper - like FiLM conditioning. I've contacted the authors of the paper - and they say, that they will release their code rather soon, after fixing some issues and tidying the code. I would recommend to wait for the official release
Has anyone successfully reproduced the result? Thanks~ :)
Seems to be missing the training code