lucidrains / gigagan-pytorch

Implementation of GigaGAN, new SOTA GAN out of Adobe. Culmination of nearly a decade of research into GANs
MIT License
1.84k stars 108 forks source link

Training plans? #17

Closed nbardy closed 1 year ago

nbardy commented 1 year ago

I've got a bunch of compute the next couple weeks and thinking to train this on LAION.

Wondering if there is any other training going on right now. Would hate to duplicate efforts too much.

nbardy commented 1 year ago

Thanks for all the great work.

I'm happy to take the distributed stuff from here. Was hoping to have a distributed run going today on the cluster, but just got a single chip running I have a couple different training scripts on my fork one of them uses ray and accelerate.

Just got a webdataset script working with the upsampler on the TPU chip. Was surprisingly a pain debugging webdataset pipe errors and setting up credentials.

lucidrains commented 1 year ago

@nbardy yea no problem, i know how it is. things are never straightforward in software

@CerebralSeed pull requested the sampler script and validated that the upsampler works! that should unblock you for your work

i'm going to give accelerate integration (sans ray, since i'm not familiar with it) a try today

nbardy commented 1 year ago
image

Learning on the accelerated chips finally! Remarkably good results for 40 steps in. Last time I trained a GAN was a very long time ago.

Losses look stable.

image

Looking at the XLA docs trying to figure out what the best way to network this is with tpus. Might just drop ray 🤔 already checkpointing and tracking runs with WB.

https://wandb.ai/nbardy-facet/gigagan/runs/zv9004dr?workspace=user-nbardy-facet

nbardy commented 1 year ago

Got started on XMP today. It’s getting stuck on step 1 . Most likely more device errors

nbardy commented 1 year ago

Accelerate was giving bad crashes. Probably incompatible.

nbardy commented 1 year ago

I will talk more with Google tomorrow. They will mostly likely be able to help me sort this out end of day tomorrow.

lucidrains commented 1 year ago

@nbardy good to see some progress on your end!

for me, i was stuck on a bug in the base generator architecture, but finally got it working before bedtime

sample-32

i'm going to wire up accelerate this morning (this time for real lol) and try out that vision aided discriminator loss

nbardy commented 1 year ago

Training across 16 chips with XLA/XMP.

Logs(Currently very slow because XLA is compiling the first steps and debug mode is on)

nbardy commented 1 year ago

And they all crash at 30 minutes :(

lucidrains commented 1 year ago

And they all crash at 30 minutes :(

haha yea, expected this to be not that mature

they are basically exchanging free compute for free QA

today was much smoother sailing for me; accelerate and mixed precision is working for multi-gpu on my one machine!

randintgenr commented 1 year ago

Hi Phil,

I have been using your implementation and noted that subpixel upsampling is giving me a lower generative performance.

It is introducing checkerboard artifacts that negatively affect the quality of the generated images. To address this, I have experimented with replacing subpixel convolution with Bilinear Upsampling, and it has yielded better results.

Also, the StyleGAN generator relies on maintaining unit variance for its feature activations for effective style mixing. It is unclear if the subpixel upsampling still leads to activations that are unit variance.

lucidrains commented 1 year ago

Hi Phil,

I have been using your implementation and noted that subpixel upsampling is giving me a lower generative performance.

It is introducing checkerboard artifacts that negatively affect the quality of the generated images. To address this, I have experimented with replacing subpixel convolution with Bilinear Upsampling, and it has yielded better results.

Also, the StyleGAN generator relies on maintaining unit variance for its feature activations for effective style mixing. It is unclear if the subpixel upsampling still leads to activations that are unit variance.

hey yup! i was actually going to offer this as an option as i noticed the same

defaulted it to bilinear upsample for now, controllable with this option

lucidrains commented 1 year ago

@randintgenr are you a computer vision researcher?

lucidrains commented 1 year ago

almost done with the entire training code

lucidrains commented 1 year ago

ok, i think it is done, save for a few edge cases and cleanup

going to wind down work on this repo next week and move back to video gen

lucidrains commented 1 year ago

closing, as code is there, and I know of a group moving forward with training already

anandbhattad commented 1 year ago

Hey @lucidrains, have you heard anything about a timeline for the group that's currently training GigaGAN? I'd appreciate any information you have. Thank you!

lucidrains commented 1 year ago

@anandbhattad yea they have proceeded, but this group will not be doing it open sourced

anandbhattad commented 1 year ago

@lucidrains, I appreciate your response. I was wondering if you knew the necessary computing power for training on the LIAON-5B dataset. The paper lacks clear information on compute and time requirements for training the model (Table A2 is ambiguous). As I only have academic compute access, I am interested in exploring whether GigaGAN utilizes familiar rendering elements such as normals and depth like we demonstrated in StyleGAN-2. Here's the link for more information: https://arxiv.org/abs/2306.00987

CerebralSeed commented 1 year ago

@nbardy would greatly appreciate if you're able to share what image size and other settings you use, if you get anything that works at a size larger than 128px. TIA

davizca commented 11 months ago

@lucidrains I'm pretty sure that group is this one: https://magnific.ai/

Or at least it seems so. If I had money and anything more than 24 GB VRAM I will train this but is impossible for me, haha.

topological-modular-forms commented 10 months ago

@nbardy Hi Nicholas! Do you still plan to train this model on LAION, or have any updates regarding it?