openai / guided-diffusion

MIT License
6.06k stars 807 forks source link

What is the minimum amount of VRAM needed to train 512 or 256 model? #32

Closed Penguin-jpg closed 2 years ago

Penguin-jpg commented 2 years ago

I used Tesla T4 on google colab with batch size 1 but still get cuda out of memory error. Is 16GB VRAM not enough to train 512 model? (I also tired 256 uncond model with batch size 1 and still cuda out of memory.)

These are flags I used:

MODEL_512_FLAGS = "--attention_resolutions 32,16,8 --class_cond False --image_size 512 --learn_sigma True --num_channels 256 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True --resume_checkpoint /content/models/512x512_diffusion_uncond_finetune_008100.pt"
MODEL_256_FLAGS = "--attention_resolutions 32,16,8 --class_cond False --image_size 256 --learn_sigma True --num_channels 256 --num_res_blocks 2 --resblock_updown True --use_fp16 True --use_scale_shift_norm True --resume_checkpoint /content/drive/MyDrive/models/256x256_diffusion_uncond.pt"
DIFFUSION_FLAGS = "--diffusion_steps 4000 --noise_schedule linear"
TRAIN_FLAGS = "--lr 1e-4 --batch_size 1 --save_interval 10000 --log_interval 1000"

script:

!python scripts/image_train.py --data_dir /content/drive/MyDrive/datasets/animals/animals_10/images $MODEL_512_FLAGS $DIFFUSION_FLAGS $TRAIN_FLAGS
RabJon commented 10 months ago

@Penguin-jpg did you find an answer for your question?

I have a similar issue, on a single GPU with 12GB VRAM I always get torch.cuda.OutOfMemoryError: CUDA out of memory, even if I set the batch size to 1.

Penguin-jpg commented 10 months ago

@Penguin-jpg did you find an answer for your question?

I have a similar issue, on a single GPU with 12GB VRAM I always get torch.cuda.OutOfMemoryError: CUDA out of memory, even if I set the batch size to 1.

Hello, I think the problem is that 12GB is really not enough, so you might need a GPU with larger VRAM.

RabJon commented 10 months ago

Thanks for your fast reply @Penguin-jpg . I think you are right for the default configuration, but I made a lucky punch and changed the --num_channels parameter from 256 to 128. Now I can train on my 128x128 images without getting memory issues.

sushilkhadkaanon commented 7 months ago

@RabJon @Penguin-jpg did you guys resolve the issue? I have 4 T4 GPUs (16 GB each) but I'm getting CUDA out of memory issue even when batch_size = 1.

RabJon commented 7 months ago

Thanks for your fast reply @Penguin-jpg . I think you are right for the default configuration, but I made a lucky punch and changed the --num_channels parameter from 256 to 128. Now I can train on my 128x128 images without getting memory issues.

@sushilkhadkaanon as I wrote in the comment above, I could solve my problems for 128x128 px images. However, I am not sure about bigger images.