lucidrains / gigagan-pytorch

Implementation of GigaGAN, new SOTA GAN out of Adobe. Culmination of nearly a decade of research into GANs
MIT License
1.78k stars 99 forks source link

Training plans? #17

Closed nbardy closed 1 year ago

nbardy commented 1 year ago

I've got a bunch of compute the next couple weeks and thinking to train this on LAION.

Wondering if there is any other training going on right now. Would hate to duplicate efforts too much.

lucidrains commented 1 year ago

Going to try and get training code running tomorrow and try to get the unconditioned one converging.

Surprised how different the upscaler code looks from the Generator. Generator also has some stuff like skip_layer_excite that looks nice.

Glad to hear your other project is wrapped up. What was it? Open source?

hey, the PR looks great! yes, we can adopt a styling convention since it is clear you are a seasoned engineer from first glance

i heard ruff is all the rage these days?

lucidrains commented 1 year ago

re: other project - there's been a small breakthrough leading to a few SOTAs in the geometric deep learning space (which is still being used for molecules / proteins) https://arxiv.org/abs/2302.03655 math was quite hairy, so took me nearly a week or two to nail down

lucidrains commented 1 year ago

@nbardy yes, i can get the unconditional training code underway done by end of the day, and then move towards text conditioned

i noticed you are also trying Lion, but i would caution against using it, as the paper never explored it in the GAN setting

lucidrains commented 1 year ago

@nbardy for skip layer excitation, i tried it in a unet setting some time back, and didn't see much an improvement; i think the concatenative skip connections does most of the heavy lifting already. but i'm not much of an experimentalist, just didn't see a dramatic improvement in the first 5k steps. i could add it

nbardy commented 1 year ago

Thanks for the update. Code looks great.

nbardy commented 1 year ago

re: other project - there's been a small breakthrough leading to a few SOTAs in the geometric deep learning space (which is still being used for molecules / proteins) https://arxiv.org/abs/2302.03655 math was quite hairy, so took me nearly a week or two to nail down

This looks super interesting. Wish I had more time for the geometric stuff.

lucidrains commented 1 year ago

@nbardy ran out of steam, will work more on the training code tomorrow morning!

nbardy commented 1 year ago

Is it obvious to you what is a big contribution in this paper?

Seems like the adaptive kernel is important to keep the parameter count down. And then just lots of tricks to keep the training stable at scale.

I will go through tomorrow and try to line up the hyper-parameters with the different paper models.

nbardy commented 1 year ago

Do you think LION will fail on a smaller model?

Looking at try a few optimizers across a sweep to start. distributed shampoo and Adam are top candidates right now. boris has a lot of positive notes on shampoo for similar sized models for ~460M param models for dalle-mini

lucidrains commented 1 year ago

@nbardy i would just stick with Adam, as most of the architectural tricks we know probably overfit to Adam, for GAN training

lucidrains commented 1 year ago

Is it obvious to you what is a big contribution in this paper?

Seems like the adaptive kernel is important to keep the parameter count down. And then just lots of tricks to keep the training stable at scale.

I will go through tomorrow and try to line up the hyper-parameters with the different paper models.

i would say adaptive conv kernel, incorporation of the l2 distance self attention, as well as the scale invariant discriminator

a bit of a bag of tricks paper, but the results is what counts

lucidrains commented 1 year ago

the truth is, any of these concepts would benefit DDPMs as well.. but let's just keep moving forward. people can just pip install this library and experiment with the separate modules

never thought i'd be doing GANs after all this time

lucidrains commented 1 year ago

made a tiny bit of progress; unfortunately unconditional image synthesis didn't work on the first try

I'll try to debug what's going on tomorrow morning - still need to account for aux losses and gradient penalty. I think order of attack would be to be able to train a small upsampler on one machine before going full distributed

nbardy commented 1 year ago

Exciting, I have cleared my schedule tomorrow and next week to work only on training the upsampler.

Correct me if I’m wrong, but looking at the code it looks input wxh is fixed to model architecture size. It’s so fast we can tile it at inference time for different resolutions. Shouldn’t be a problem.

lucidrains commented 1 year ago

@nbardy ah yea, i don't think different aspect ratios were used in the paper? may be a nice to have; we can start with square images, and get that working for starters

lucidrains commented 1 year ago

revisiting all this complicated GAN training code, all I can say is, thank god for denoising diffusion

lucidrains commented 1 year ago

hmm, no there's still something wrong, training blows up, even when i add gradient penalty (usually adding GP is enough for me to stabilize training early on) maybe there's a bug in the discriminator

lucidrains commented 1 year ago

@nbardy you want to give it a try? we should make sure it can work for mnist (I'm using the old Oxford flowers dataset)

lucidrains commented 1 year ago

@nbardy ok, i'm not sure what's going on, going to give up for the day

next plan of attack will probably be to copy paste the generator and discriminator from my working repositories (stylegan2-pytorch or lightweight-gan), and work from the bottom up. an alternative idea would be to pare down the generator and discriminator here and plug them into the working stylegan2-pytorch repo, and that would validate which modules are working or not, piece by piece

nbardy commented 1 year ago

Thanks for the updates.

I split off the distributed train script and have been working on getting this train script to run. https://github.com/nbardy/gigagan-pytorch/blob/main/training_scripts/train_simple.py

I can move over to testing your train script.

lucidrains commented 1 year ago

ok cool! yeah I'll resume trying to debug the system tomorrow morning

nbardy commented 1 year ago

I can try testing the discriminators as classifiers today.

lucidrains commented 1 year ago

@nbardy hey, good timing! i'm about to test the generator in lightweight gan, and rule out that the issue rests in the generator for starters

lucidrains commented 1 year ago

@nbardy do you want to chat on Signal btw? (may be more convenient for a lot of back and forth) i can send you my number through email

nbardy commented 1 year ago

Yea can you email me your signal. Just waking up will start work in a few hours

Nicholasbardy@gmail.com

lucidrains commented 1 year ago

@nbardy ok cool, i'm actually nearing end of work and doing park stuff with doggo rest of day. will know more about whether generator is borked or not in half an hour

let me send you my number

lucidrains commented 1 year ago

0

ok, generator is not the issue! will move on to gigagan discriminator tomorrow

lucidrains commented 1 year ago

11-ema

11k steps for gigagan generator paired with lightweight gan discriminator

looks ok

lucidrains commented 1 year ago

bug is probably in discriminator somewhere, let me throw a few hours this morning at this, see if i can find it. pretty sure i can find it by tomorrow's end for sure (as well as solidify all the auxiliary losses)

nbardy commented 1 year ago

Awesome!

Sorry, I have not been working this weekend. Laying down to rest. My neck is in pain this week.

On Sun, Jul 16, 2023 at 9:26 AM Phil Wang @.***> wrote:

bug is probably in discriminator somewhere, let me throw a few hours this morning at this, see if i can find it. pretty sure i can find it by tomorrow's end for sure (as well as solidify all the auxiliary losses)

— Reply to this email directly, view it on GitHub https://github.com/lucidrains/gigagan-pytorch/issues/17#issuecomment-1637132453, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJYH7EU5HUS2CSUKYIHEVLXQQI2JANCNFSM6AAAAAAZOAK6UE . You are receiving this because you were mentioned.Message ID: @.***>

-- Nicholas Bardy - nicholasbardy.com

lucidrains commented 1 year ago

hey no worries, rest up!

will need to move on to some other work mid-next week, but i'm sure it'll be semi functional by then, save for probably one or two aux losses + distributed. just so you can plan ahead for work

lucidrains commented 1 year ago

good news, have gigagan training using the lightweight gan peripheral training code. losses look ok; generator loss still a bit on the high side, but stabilizing

gigagan training is still borked, so either the self-supervised loss is crucial, or it is something else

will resume tomorrow morning; have a great rest of your Sunday and hope your neck feel better!

lucidrains commented 1 year ago

ok further good news, validated multi-scale inputs and scale invariant code + skip layer excitation all works over at lightweight gan. converges much nicer too

so maybe the issue was just with torch.cdist and / or any remaining issues with the gan training loop

lucidrains commented 1 year ago

training is now stable in the main repo, even without reconstruction loss 👌 turns out the GLU doesn't work that hot in this setting, so i removed it

nbardy commented 1 year ago

:) Exciting. Managed to cancel all my meetings this week.

have you tested the discriminators yet?

lucidrains commented 1 year ago

@nbardy aha yea, that's a good first step towards productivity

i haven't done the hinge loss for the multiscale outputs yet, but i reckon it should be fine. should know before noon

lucidrains commented 1 year ago

yea, it is working with the multiscale logits being involved, but the loss is very rocky for the first 1k steps

i'll let it run until 5-10k and see if it stabilizes. worse comes to worse, can always taper in the multi-scale contribution

lucidrains commented 1 year ago

ok, once i took out the gradient penalty contributions for multi-scale logits, training is back to being very stable

let us roll with that for now!

lucidrains commented 1 year ago

Screenshot from 2023-07-17 13-10-49

looking great - will move on towards rest of the losses, accelerate integration, text conditioning tomorrow

do you know if the text encoder was shared between discriminator and generator, or separate?

nbardy commented 1 year ago

I'm able to get the upscaler and base gan running locally.

Getting a missing op error when the gradient penalty is applied. RuntimeError: derivative for aten::linear_backward is not implemented

Seems like a macbook related thing I'll just ignore it for now.

I'm going to see if I can get it running on TPUs this afternoon.

nbardy commented 1 year ago
image

Looks like the same text encoder from the generator. They use only the global code.

lucidrains commented 1 year ago

I'm able to get the upscaler and base gan running locally.

Getting a missing op error when the gradient penalty is applied. RuntimeError: derivative for aten::linear_backward is not implemented

Seems like a macbook related thing I'll just ignore it for now.

I'm going to see if I can get it running on TPUs this afternoon.

ok cool, i'm moving onto another project for the rest of the day; will throw a bit more hours at this tomorrow morning

lucidrains commented 1 year ago
image

Looks like the same text encoder from the generator. They use only the global code.

so they have a 'few learnable attention layers' in addition to the CLIP text encoder. i guess i'm wondering if that is shared between generator and discriminator or no

probably safest just to learn them separately

francqz31 commented 1 year ago

Based on the paper, it seems that GigaGAN uses separate text encoders for the generator and discriminator: 1-For the generator, it extracts text features from CLIP and processes them through additional learned attention layers T to get text embeddings t. 2-For the discriminator, it similarly applies a pretrained text encoder like CLIP, followed by extra learned attention layers to get the text descriptor t_D. The paper mentions using t_local and t_global for the generator, and just a global descriptor t_D for the discriminator. So the text encoders have similar architectures (pretrained CLIP + extra attention) but with separate learned parameters. The motivation is that the generator and discriminator have different requirements for the text embedding. The generator needs both local word-level features t_local and global sentence-level features t_global, while the discriminator only needs to extract an overall global descriptor of the text prompt for its real/fake prediction. So using separate encoders allows customizing them for each task.

lucidrains commented 1 year ago

@francqz31 nice find!

nbardy commented 1 year ago

I was not able to find the t_local and t_global sizes in the paper.

nbardy commented 1 year ago

Reading through training details. Some notes on datasets and models size from the paper.

with the exception of the 128-to 1024 upsampler model trained on Adobe’s internal Stock images.

That is the 8x upsampler that gives the stunning results in the paper.

Unfortunately it's hyper-parameters are not in the paper, but I imagine it would be about the same size maybe a little deeper to get some higher resolution features. Should take less compute than the text conditioned upscalers.

Also interesting

Additionally, we train a separate 256px class-conditional upsampler model and combine them with end-to-end finetuning stage.

Does this mean training the text->image and upsampler models in series for fine tuning, I hadn't noticed before.

lucidrains commented 1 year ago

ok, finished the text-conditioning logic for both base and upsampler

going to start wiring up accelerate probably this afternoon (as well as some hparams for more efficient recon and multi-scale losses)

lucidrains commented 1 year ago

will also aim to get the eval for both base and upsampler done, using what @CerebralSeed pull requested as a starting point. then we can see the GAN working for some toy datasets for unconditional training

lucidrains commented 1 year ago

@nbardy or were you planning on doing the distributed stuff with accelerate + ray today? just making sure no overlapping work