Closed osfa closed 1 year ago
Same here. 3x 3090, No joy.
One thought. The environment.yaml file specifies Cuda 11.0. The 3090 requires 11.1 or better. (see https://forums.developer.nvidia.com/t/cuda-10-1-on-rtx-3090/185255/2)
I'm not sure what the consequences are wrt the supported PyTorch and then cascading set of dependency changes that happen, but were going to try 11.1.
The forked master ( presser/stable-diffusion) has 11.0, but strangely compvis/stable-diffusion has cuda 11.3
and the Yaml file was updated a month ago, which seems to have different dependencies pinned. Just an observation, but this repo at 11.0 is likely going to be an issue for newer coda hardware.
Hmm yeah maybe related? I'm having these issues in an environment with 11.3 anyways.
I'm also unable to run this in an environment with 11.3.
I think the environment.yaml
file is definitely outdated, as it mentions pytorch=1.7.0
, which is incompatible with the dependencies inside requirements.txt
(specifically, kornia==0.6
requires torch > 1.8)
@justinpinkney it would be great if you could specify the torch and torchvision versions in the requirements.txt
, as well as the CUDA toolkit version.
FWIW, I tried reducing the batch size to 1, and training and validation image sizes to 128, but still got OOM errors, which makes me think there's something fundamentally wrong. I'd be very curious how much memory consumption you get when you run on an A6000 with those params.
Also happy to run any tests needed to help get this running on a 3090. Any help is much appreciated. Thanks!
Yeah the environment.yml is old an inherited from the base repo, I should get rid of that.
The version I'm using are:
torch==1.12.1
torchvision==0.13.1
cuda 11.3.1
Thanks @justinpinkney.
I setup a fresh docker container with the mentioned requirements via docker pull pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime
, then installed deps from the requirements.txt
. I had to uninstall torchtext
to get the script working, but then failed with the following OOM error:
RuntimeError: CUDA out of memory. Tried to allocate 58.00 MiB (GPU 0; 23.70 GiB total capacity; 21.90 GiB already allocated; 13.62 MiB free; 22.07 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
I modified image size and batch size as follows:
data:
target: main.DataModuleFromConfig
params:
batch_size: 1
num_workers: 0
num_val_workers: 0
train:
target: ldm.data.simple.hf_dataset
params:
name: lambdalabs/pokemon-blip-captions
image_transforms:
- target: torchvision.transforms.Resize
params:
size: 64
interpolation: 3
- target: torchvision.transforms.RandomCrop
params:
size: 64
- target: torchvision.transforms.RandomHorizontalFlip
validation:
target: ldm.data.simple.TextOnly
params:
captions:
- A pokemon with green eyes, large wings, and a hat
- A cute bunny rabbit
- Yoda
- An epic landscape photo of a mountain
output_size: 128
n_gpus: 1
But still failed with the same error. Curious if you're able to replicate this. Any ideas how to proceed?
@justinpinkney The full stack trace is:
EMA is happening on GPU, which would explain the OOM issue. Based on my past experience, this is often done on CPU to save GPU memory (Example from timm
, Another Example from HF)
I'm gonna attempt making it happen on CPU, will post noteworthy updates here.
Curious if you have any more insights regarding this. Cheers!
Hmmm I think it used around 40gb of vram when training with a batch size of 4.
Yeah looks like the ema is adding some extra weight which might be the issue.
You could also try reduced precision training. The repo uses pytorch lightning so should be straightforward to try mixed precision training. Couldn't comment on the stability of that though...
I did try precision=16
but failed with some hard to diagnose errors. Didn't bother trying to solve that.
40gb of vram when training with a batch size of 4
I see. I think I can confidently say that setting up the model + EMA on GPU alone takes > 24 GB ram in its current state. Moving EMA to CPU would probably allow you to increase batch size by a reasonable amount
Modified EMA as follows:
But now I fail later in the forward pass with an OOM error. I'm not sure if this can be run on 3090s without heavier modifications which I suspect would be a bit gnarly...
For anyone else trying to run this on 3090s, I was able to run huggingface's WIP script on a single 3090. Couldn't quite get the right accelerate config
for multi-GPU yet.
https://github.com/huggingface/diffusers/pull/356
python train_text_to_image.py \
--dataset_name lambdalabs/pokemon-blip-captions \
--use_auth_token \
--mixed_precision fp16 \
--resolution 512 \
--gradient_accumulation_steps 4 \
--train_batch_size 1 \
--num_train_epochs 1 \
--output_dir sd-pokemon
haven't it tried it yet, but this might be another avenue for 3090 gang: https://github.com/XavierXiao/Dreambooth-Stable-Diffusion
@rsomani95
To launch on multi-gpu you can use the following command:
accelerate launch --multi_gpu train_text_to_image.py \
--dataset_name lambdalabs/pokemon-blip-captions \
--use_auth_token \
--mixed_precision fp16 \
--resolution 512 \
--gradient_accumulation_steps 4 \
--train_batch_size 1 \
--num_train_epochs 1 \
--output_dir sd-pokemon
I have a machine with 6x GPU of 12GB ram. Is it possbile launch finetune on such kind of hardware configuration? It seems even 24G 3090 do not working.
I just simply set the use_ema=False. Then, the default main.py can be run, but ema model lost. So the 3090 can train the model with a bit more VRAM. I also have another idea,precalculate the condition text to embed space to save GPU VRAM. Crying from a poor lab.
I just simply set the use_ema=False. Then, the default main.py can be run, but ema model lost. So the 3090 can train the model with a bit more VRAM. I also have another idea,precalculate the condition text to embed space to save GPU VRAM. Crying from a poor lab.
Sadly, The clip model has little effect on GPU memory.
@patil-suraj Is there any way to train on multiple GPU without using accelerate launch
For anyone else trying to run this on 3090s, I was able to run huggingface's WIP script on a single 3090. Couldn't quite get the right
accelerate config
for multi-GPU yet.python train_text_to_image.py \ --dataset_name lambdalabs/pokemon-blip-captions \ --use_auth_token \ --mixed_precision fp16 \ --resolution 512 \ --gradient_accumulation_steps 4 \ --train_batch_size 1 \ --num_train_epochs 1 \ --output_dir sd-pokemon
@rsomani95 Have you found any way to train on multiple GPU?
@StrugglingForBetter I did, yes. There's some GPU memory overhead, so it's not possible to train the full model. But, if you freeze the params for part of the model, it fits in memory.
Using EMA on multi-GPU
I have a 3x 3090 setup, and I hard-coded the EMA model to load on torch.device(2)
and train on only the first two gpus.
In addition to the changes in the script, I set up my accelerate config
to only use gpu_ids 0,1
for training
@StrugglingForBetter I did, yes. There's some GPU memory overhead, so it's not possible to train the full model. But, if you freeze the params for part of the model, it fits in memory.
Using EMA on multi-GPU
I have a 3x 3090 setup, and I hard-coded the EMA model to load on
torch.device(2)
and train on only the first two gpus. In addition to the changes in the script, I set up myaccelerate config
to only use gpu_ids0,1
for training
could you share the code of putting the EMA model on different gpu?
@StrugglingForBetter I did, yes. There's some GPU memory overhead, so it's not possible to train the full model. But, if you freeze the params for part of the model, it fits in memory.
Using EMA on multi-GPU
I have a 3x 3090 setup, and I hard-coded the EMA model to load on
torch.device(2)
and train on only the first two gpus. In addition to the changes in the script, I set up myaccelerate config
to only use gpu_ids0,1
for training
Thanks for your reply! Can you show us how to set accelerate config? Is there any way to train without accelerate config since I have never used accelerate when training on multiple GPUs before ?
BTW, can you give me your email so that I can contact you more conveiently? Thx!
But now I fail later in the forward pass with an OOM error. I'm not sure if this can be run on 3090s without heavier modifications which I suspect would be a bit gnarly...
I was able to run the code on the 3090 this way, but the results are far from the same, can you get similar results for Pokémon?
For anyone else trying to run this on 3090s, I was able to run huggingface's WIP script on a single 3090. Couldn't quite get the right
accelerate config
for multi-GPU yet.python train_text_to_image.py \ --dataset_name lambdalabs/pokemon-blip-captions \ --use_auth_token \ --mixed_precision fp16 \ --resolution 512 \ --gradient_accumulation_steps 4 \ --train_batch_size 1 \ --num_train_epochs 1 \ --output_dir sd-pokemon
how to train on multiple GPUs?
Hi!
The repo and finetuning tutorial mention >16GB as a minimum amount of VRAM. Has this been verified somehow? Running this on an RTX3090 (24GB VRAM) and CUDA is still OOM:ing.
With that result running this on Google Colab doesn't seem particularly possible?
The tutorial mentions reducing batch_size which I've done (by changing batch_size and num_workers to 1 in the pokemon.yaml conf) but still CUDA OOM.
Are there any other params one can change to reduce memory use? Were batch_size and num_workers the correct parameters to tweak in the first place?
Thanks!