lucidrains / gigagan-pytorch

Implementation of GigaGAN, new SOTA GAN out of Adobe. Culmination of nearly a decade of research into GANs
MIT License
1.71k stars 96 forks source link

Training plans? #17

Closed nbardy closed 11 months ago

nbardy commented 1 year ago

I've got a bunch of compute the next couple weeks and thinking to train this on LAION.

Wondering if there is any other training going on right now. Would hate to duplicate efforts too much.

lucidrains commented 1 year ago

@nbardy where do you have the compute from? you should join the LAION discord and check to see first

i will be finishing the unconditional training code this week for starters, before the entire training code by end of month

nbardy commented 1 year ago

512 TPUv4 from a google startup grant.

Didn't get any response in LAION when I asked. Looks like nothing going on yet.

lucidrains commented 1 year ago

ohh sweet, though you probably should do it in jax? or has the state of pytorch xla improved?

lucidrains commented 1 year ago

are you doing a startup? or working for a new one?

francqz31 commented 1 year ago

@nbardy I think you should just train it for the Super Resolution Upsampling Task 128px to 4k Which is the highlight of the paper. Gigagan's text to image is kinda meh and not good nor impressive.

What's impressive and holds the current SOTA in text to Image is this project https://raphael-painter.github.io/ it even beats midjouney v5.1 and is competitive with 5.2v and has Efficient finetuning lucid might implement raphael and you might train it that would be a far better idea than wasting all that compute on nothing.

lucidrains commented 1 year ago

@francqz31 oh nice, wasn't aware of raphael. there is no implementation yet?

lucidrains commented 1 year ago

@francqz31 i see, they just added a ton of mixture of experts. i have been meaning to open source ST-MoE for language modeling front, so maybe this is good timing. also have a few ideas for improving PKM

francqz31 commented 1 year ago

@lucidrains Nope there isn't , I asked one of the authors he said something about releasing an api or something but they will not open source it that's 100% for sure. downside of an api that i don't think it will have fine-tuning. but yeah overall they trained it on 1000 A100s for 2 months straight , if you implement it and nbardy trains it. it will be a huge leap in the opensource community.

lucidrains commented 1 year ago

@francqz31 i haven't dived into the paper yet, but i think there's basically nothing to it besides adding MoE and some hand wavy stuff about each expert being 'painters'. i just need to do to mixture-of-experts what i did to attention, and both language and generative image / videos will naturally improve if one replaces the feedforwards with them

lucidrains commented 1 year ago

@francqz31 it was on my plate anyways, since we now know GPT4 uses mixture of experts

lucidrains commented 1 year ago

@francqz31 do correct me if i'm wrong about that paper. i will get around to reading it (too much in the queue)

francqz31 commented 1 year ago

@lucidrains that's my pleasure I indeed will , I even took some prompts of raphael and I compared it with midjourney v5.2 , it is almost the same if not even better , But in the paper they compare with v5.1 like this for example with 5.1v get (57) prompts by order:

  1. A cute little matte low poly isometric cherry blossom forest island, waterfalls, lighting, soft shadows, trending on Artstation, 3d render, monument valley, fez video game
  2. A shanty version of Tokyo, new rustic style, bold colors with all colors palette, video game, genshin, tribe, fantasy, overwatch.
  3. Cartoon characters, mini characters, figures, illustrations, flower fairy, green dress, brown hair, curly long hair, elf-like wings, many flowers and leaves, natural scenery, golden eyes, detailed light and shadow , a high degree of detail.
  4. Cartoon characters, mini characters, hand-made, illustrations, robot kids, color expressions, boy, short brown hair, curly hair, blue eyes, technological age, cyberpunk, big eyes, cute, mini, detailed light and shadow, high detail.
lucidrains commented 1 year ago

@francqz31 cool! yea, i guess this is yet another testament to using mixture-of-experts or conditional computation modules

nbardy commented 1 year ago

Definitely most interested in training the upscaler.

@lucidrains do you have an idea how much work is left for the upscaler code? Looking at the paper it seems pretty similar to the base unconditioned model with some tweaks.

although the paper is light on details about the upscaler

I’m still at the same startup, Facet.

Talking to the Google team and they said the performance is very similar between PyTorch and Jax now.

nbardy commented 1 year ago

@francqz31 thanks for sharing, too much work to implement and train a new model architecture on a short timeline. Raphael does look quite interesting, although expensive to run inference with MoE.

particularly interested in the openMUSe training going on.

francqz31 commented 1 year ago

@nbardy no problems don't feel any pressure , Dr. phil might just implement it and leave it for the open source community. if any one else is interested. someone will be hopefully.

francqz31 commented 1 year ago

it is more than enough that you are willing to train the Upsampler. it is not an easy work. plus it is the most important thing in the paper.

lucidrains commented 1 year ago

@nbardy i'll get to it soon, but like anything in open source, no promises on timeline

@francqz31 oh please, don't address me that way. got enough of that in med school

nbardy commented 1 year ago

Happy to jump in and help.

How up to date is the TODO list? You mentioned there is some work left on the unconditioned model code still.

lucidrains commented 1 year ago

@nbardy yea, the plan of attack was going to be to wire up hf accelerate for unconditional, following their example here, then move on to conditional, before finally tackling the upsampler modifications

lucidrains commented 1 year ago

@nbardy are you planning on open sourcing the final model, or is this for commercial purposes for Facet?

francqz31 commented 1 year ago

@francqz31 do correct me if i'm wrong about that paper. i will get around to reading it (too much in the queue)

Ok here is a quick thing that I hacked Because I read the paper before.

To implement the RAPHAEL model described in this paper, here are the main steps they used:

1-Data collection and preprocessing They Collect a large-scale dataset of text prompt-image pairs. This paper uses LAION-5B of course and some internal datasets. They Preprocess the images and text by removing noise, resizing images, etc. 2-Model architecture The model is based on a U-Net architecture with 16 transformer blocks. Each block contains: A self-attention layer A cross-attention layer with textPrompt A space-Mixture-of-Experts (space-MoE) layer A time-Mixture-of-Experts (time-MoE) layer An edge-supervised learning module 3-Space-MoE The space-MoE layer uses experts to model the relationship between text tokens and image regions. A text gate network is used to assign text tokens to experts. A thresholding mechanism is used to determine the correspondence between text tokens and image regions. There are 6 space experts in each of the 16 transformer blocks. 4-Time-MoE The time-MoE layer uses experts to handle different diffusion timesteps. A time gate network is used to assign timesteps to experts. There are 4 time experts. 5-Edge-supervised learning. Add an edge detection module to extract edges from the input image. Supervise the model using these edges and a focal loss. Pause edge learning after a certain timestep threshold. 6-Training They Use the AdamW optimizer with learning rate 1e-4. They Train for 2 months on 1000 GPUs with a batch size of 2000, Warmup steps 20000. They Combine a denoising loss and an edge-supervised loss. Optional: Use LoRA, ControlNet or SR-GAN for additional controls or higher resolution. *They use a private tailormade SR-GAN model too I think not the public one but that can be replaced by the Gigagan upsampler ;).

lucidrains commented 1 year ago

@francqz31 thanks for the rundown!

yea, there is nothing surprising then. mostly more attention (transformer blocks), and the experts per diffusion timesteps goes back to eDiff from Balaji et al

the application of space and time MoE seems to be the main novelty, but that in itself is just porting over lessons from LLM

nbardy commented 1 year ago

@nbardy are you planning on open sourcing the final model, or is this for commercial purposes for Facet?

Got the all clear to open source the weights.

Might finetune on some proprietary data. But the base model trained on LAION we'd release.

lucidrains commented 1 year ago

@nbardy awesome! i will prioritize this! expect me to power through it this weekend

nbardy commented 1 year ago

🥳

lucidrains commented 1 year ago

didn't get to it this weekend :cry: caught up with some TTS work and Pride celebrations

going to work on it this morning!

lucidrains commented 1 year ago

@nbardy the upsampler is nothing more than a unet with some high resolution downsampling layers removed, should be straightforward!

lucidrains commented 1 year ago

ok, got the unet upsampler to a decent place, will move onwards to unconditional training tomorrow, and by week's end, conditional + unet upsampler training

nbardy commented 1 year ago

Exiting progress.

Trying to start some jobs this week and there is no actually available TPUv4. We have the quota but the LLMs teams must be taking them all. Yet to see if we actually have compute :( or if it's a mirage.

Probably willing to pay to scale up a smaller version of this. It looks like the compute budget isn't too high for the upscaler.

lucidrains commented 1 year ago

Haha yeah, they are busy training Gemini I heard

No worries, take your time, as the training code isn't ready yet

nbardy commented 1 year ago

Alright we've got some other preview chips now(I think their existence is under NDA right now). But should be plenty for the upscaler training

lucidrains commented 1 year ago

@nbardy nice! i'll get unconditional training wired up tomorrow morning and make sure the discriminator works, before moving on to the rest of the training code next Monday (some of my favorite electronic music artists are in town this weekend)

lucidrains commented 1 year ago

@nbardy always welcoming PRs, if you are in a hurry!

nbardy commented 1 year ago

I’ll be on a long weekend break. I can take a look at an upsampler training script next week

lucidrains commented 1 year ago

ok, let us reconvene on this Monday then

nbardy commented 1 year ago

Won’t be back until Wednesday actually

On Thu, Jun 29, 2023 at 4:04 PM Phil Wang @.***> wrote:

ok, let us reconvene on this Monday then

— Reply to this email directly, view it on GitHub https://github.com/lucidrains/gigagan-pytorch/issues/17#issuecomment-1613903195, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAJYH7DZ2V5V5A56226FVITXNYCYXANCNFSM6AAAAAAZOAK6UE . You are receiving this because you were mentioned.Message ID: @.***>

-- Nicholas Bardy - nicholasbardy.com

lucidrains commented 1 year ago

haha or Wednesday, whenever you are free

i'll take my time here then

nbardy commented 1 year ago

image Useful compute cost notes from the paper. The text conditioned model is about 2 orders or magnitude more compute. Super-Res much more reasonable.

Got a ray cluster running tonight. Should have some time to look into a training script Friday

nbardy commented 1 year ago

It's unclear to me in the paper how the ImageNet superRes model and text conditioned upsampler compare in quality. Will have to see if they have ablations there.

nbardy commented 1 year ago

In addition, for more controlled comparison, we train our model on the ImageNet unconditional superresolution task and compare performance with the diffusion-based models, including SR3 [81] and LDM [79].

Looks like the smaller one was mostly for benchmarking.

I think text conditioned upscaling would be much more useful for running after other model results as well in pipelines.

lucidrains commented 1 year ago

@nbardy hey! I'll circle back to this late next week unless you get to it first!

had to bring doggo out of the city to a hotel near airport since she is frightened by fireworks, so didn't get around to unconditional training code yet

https://github.com/lucidrains/gigagan-pytorch/assets/108653/cb135a28-25a9-4de7-aea3-05ec47a75b0b

also currently working on another project and cannot context switch without losing progress

francqz31 commented 1 year ago

That might be the cutest dog ever , look at him laying on the bed knowing he is a good boi hehe.

nbardy commented 1 year ago

Getting up to speed with the code today. Feel like I understood most of it.

Looks like a little text conditioning is the only thing missing in the model architecture code.

Looking at text encoding for the paper it goes through cross attention and the style network. I see the style network and text encoder are already there.

Looks like I can just add a cross attention layer with t-local here: https://github.com/nbardy/gigagan-pytorch/blob/main/gigagan_pytorch/unet_upsampler.py#L495

And hook up tglobal to the stylenetwork.

I started on a distributed train script today on my fork(Pretty messy at the moment maybe not worth taking a look at)

nbardy commented 1 year ago

Okay great, I'm seeing the Generator has cross attention.

Looks like the there is a text conditioned generator and unconditioned upscaler if I'm reading the code right.

lucidrains commented 1 year ago

@nbardy yup correct!

does the upscaler need text conditioning?

i'll get back to this mid-week. finally got over a big hurdle with another project

nbardy commented 1 year ago

They don't indicate which upscaler was used in the paper for which samples.

Unclear how much it matters. Could probably get good results with unconditioned as well. But given the results from some diffusion papers of scaling up the text encoder I could imagine the text conditioning stabilizing training a lot at scale.

Also going from 64-> 512 is a lot of room for artistic interpretation as well so it's a nice feature to have. There's a lot of information loss scaling up from thumbnails.

lucidrains commented 1 year ago

oh yup, text conditioning would still make sense for low res upscaling, let me aim to get that out of the way Wednesday

nbardy commented 1 year ago

Going to try and get training code running tomorrow and try to get the unconditioned one converging.

Surprised how different the upscaler code looks from the Generator. Generator also has some stuff like skip_layer_excite that looks nice.

Glad to hear your other project is wrapped up. What was it? Open source?

nbardy commented 1 year ago

Tried to add text conditioning to the Upscaler this evening. Seems like it should just be cross attention and global text code to the styleNetwork.

https://github.com/lucidrains/gigagan-pytorch/pull/20/files#diff-43ea16d9f61a65661c24088011c2c775964911cacf11aa87d17ed789730777caR434 (I linked to the relevant lines)

What formatter do you use? Would be nice to set mine the same for this repo. Getting a lot of formatting changes in the diff.