lucidrains / lightweight-gan

Implementation of 'lightweight' GAN, proposed in ICLR 2021, in Pytorch. High resolution image generations that can be trained within a day or two
MIT License
1.62k stars 220 forks source link

Should training be this slow on #89

Open jaymefosa opened 3 years ago

jaymefosa commented 3 years ago

image

CPU: i7-9750H GPU: 2070

Default run settings. img-size 256. Checked it on another system with a 2080ti and had similarly long times. Is there an issue somewhere?

jonas-klesen commented 3 years ago

Looks about right 👀 from my experience

jaymefosa commented 3 years ago

oh hm.. the official implementation (from the paper) is about 10,000 iterations / hr (10x faster) on a single GPU

furkandurmus commented 3 years ago

Same with me. I am training on RTX 3090, image size 512 batch size 32 but it has been 5 days and training is still not finished for 100k iterations.

kurotesuta commented 2 years ago

I'm on a GTX 1660 Ti, getting 4.20s/it approx..

Example: 650 training steps after 45:28 mins

Gorialis commented 2 years ago

Running 0.20.5 on pytorch 1.10.1 for CUDA 11.3 on Linux with an RTX 3090 gives me about 1.27it/s (or 0.8s/it) when training 512 image size on a source batch of 2500 images, which comes out to around 32 hours estimate for the default 150000 iterations.

OP's numbers look OK to me for the hardware. I imagine the 'single GPU' figures in the README are probably for industrial cards like Quadros or so.

One thing to note is that this implementation has to load and process the images frequently during training. If your images are heavily compressed, way too large compared to your training image size, using a strange color space (like a palette or HDR mode), or are just on a slow disk, it might be impacting the performance.

In my case, all of my images are preprocessed to be 512x512 RGB PNGs in a low compression mode, and they're stored on an M.2 SSD, so they should be quick to read and decode for training batches. If your numbers are unusually slow, try seeing if there's an issue with your training data that's leaving your GPU waiting.