Curious RAM usage on first run versus checkpoint loads

lucidrains / lightweight-gan

Implementation of 'lightweight' GAN, proposed in ICLR 2021, in Pytorch. High resolution image generations that can be trained within a day or two

MIT License

1.63k stars 222 forks source link

1024 size image model trains nicely with 16 batch size. But using the same code to create a new model, training on new images gives me OOM warnings down to batch size 1. However if I take a copy of the successfully trained checkpoints from the first model, I can train the new images just fine with batch size 16 again. Then, using the exact same code, but removing the checkpoints causes the training to OOM again.

It's as though for some reason, the first time you run the model it takes 8-16x more RAM, but materially reduces ram requirements thereafter.

lightweight_gan --data "C:\DataDirectory" --name "LightWeightFeb22" --results_dir "C:\Results" --models_dir "C:\Models" --image-size 1024 --num-train-steps 250000 --batch-size 16 --gradient-accumulate-every 1 --network-capacity 16 --amp --attn-res-layers [32,64,128,256,512,1024]

Any thoughts on why this might be happening?

TLDR: Be aware that when experimenting with attention, once checkpointed, the model will neither overwrite the model parameters, nor throw an error message if you change the attention code.

I've poked around a bit in the .JSON files for the model and discovered what was likely happening - the first time you create a model with attention, those settings are stored. Then in subsequent runs on the same model, if specifying a different amount of attention, the model does not change. But requesting a different amount of attention into an existing model doesn't generate an exception, so you might believe your model is running based on your latest coded parameters.

In my earlier model I had just a single attention layer to start, and I was experimenting with adding more layers. Unbeknownst to me, the addition of new layers wasn't actually doing anything / overwriting the old model, so the model was already locked with the initial attention settings. It was only when I pointed to a fresh directory that the code with 6 attention layers did anything, and thus OOMd.

lucidrains / lightweight-gan

Curious RAM usage on first run versus checkpoint loads #67