justinpinkney / stable-diffusion

MIT License
1.45k stars 266 forks source link

VRAM issues #15

Closed osfa closed 1 year ago

osfa commented 2 years ago

Hi!

The repo and finetuning tutorial mention >16GB as a minimum amount of VRAM. Has this been verified somehow? Running this on an RTX3090 (24GB VRAM) and CUDA is still OOM:ing.

With that result running this on Google Colab doesn't seem particularly possible?

The tutorial mentions reducing batch_size which I've done (by changing batch_size and num_workers to 1 in the pokemon.yaml conf) but still CUDA OOM.

Are there any other params one can change to reduce memory use? Were batch_size and num_workers the correct parameters to tweak in the first place?

Thanks!

vade commented 2 years ago

Same here. 3x 3090, No joy.

vade commented 2 years ago

One thought. The environment.yaml file specifies Cuda 11.0. The 3090 requires 11.1 or better. (see https://forums.developer.nvidia.com/t/cuda-10-1-on-rtx-3090/185255/2)

I'm not sure what the consequences are wrt the supported PyTorch and then cascading set of dependency changes that happen, but were going to try 11.1.

The forked master ( presser/stable-diffusion) has 11.0, but strangely compvis/stable-diffusion has cuda 11.3

and the Yaml file was updated a month ago, which seems to have different dependencies pinned. Just an observation, but this repo at 11.0 is likely going to be an issue for newer coda hardware.

osfa commented 2 years ago

Hmm yeah maybe related? I'm having these issues in an environment with 11.3 anyways.

rsomani95 commented 2 years ago

I'm also unable to run this in an environment with 11.3. I think the environment.yaml file is definitely outdated, as it mentions pytorch=1.7.0, which is incompatible with the dependencies inside requirements.txt (specifically, kornia==0.6 requires torch > 1.8)

@justinpinkney it would be great if you could specify the torch and torchvision versions in the requirements.txt, as well as the CUDA toolkit version. FWIW, I tried reducing the batch size to 1, and training and validation image sizes to 128, but still got OOM errors, which makes me think there's something fundamentally wrong. I'd be very curious how much memory consumption you get when you run on an A6000 with those params.

Also happy to run any tests needed to help get this running on a 3090. Any help is much appreciated. Thanks!

justinpinkney commented 2 years ago

Yeah the environment.yml is old an inherited from the base repo, I should get rid of that.

The version I'm using are:

torch==1.12.1
torchvision==0.13.1
cuda 11.3.1
rsomani95 commented 2 years ago

Thanks @justinpinkney. I setup a fresh docker container with the mentioned requirements via docker pull pytorch/pytorch:1.12.1-cuda11.3-cudnn8-runtime, then installed deps from the requirements.txt. I had to uninstall torchtext to get the script working, but then failed with the following OOM error:

RuntimeError: CUDA out of memory. Tried to allocate 58.00 MiB (GPU 0; 23.70 GiB total capacity; 21.90 GiB already allocated; 13.62 MiB free; 22.07 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I modified image size and batch size as follows:

data:
  target: main.DataModuleFromConfig
  params:
    batch_size: 1
    num_workers: 0
    num_val_workers: 0
    train:
      target: ldm.data.simple.hf_dataset
      params:
        name: lambdalabs/pokemon-blip-captions
        image_transforms:
        - target: torchvision.transforms.Resize
          params:
            size: 64
            interpolation: 3
        - target: torchvision.transforms.RandomCrop
          params:
            size: 64
        - target: torchvision.transforms.RandomHorizontalFlip
    validation:
      target: ldm.data.simple.TextOnly
      params:
        captions:
        - A pokemon with green eyes, large wings, and a hat
        - A cute bunny rabbit
        - Yoda
        - An epic landscape photo of a mountain
        output_size: 128
        n_gpus: 1

But still failed with the same error. Curious if you're able to replicate this. Any ideas how to proceed?

rsomani95 commented 2 years ago

@justinpinkney The full stack trace is:

```python Epoch 0: 0%| | 0/833 [00:00<00:00, 8943.08it/s]/opt/conda/lib/python3.7/site-packages/torch/autograd/__init__.py:175: UserWarning: Grad strides do not match bucket view strides. This may indicate grad was not created according to the gradient layout contract, or that the param's strides changed since DDP was constructed. This is not an error, but may impair performance. grad.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1280, 1280] bucket_view.sizes() = [1280, 1280, 1, 1], strides() = [1280, 1, 1, 1] (Triggered internally at /opt/conda/conda-bld/pytorch_1659484809535/work/torch/csrc/distributed/c10d/reducer.cpp:312.) allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass Summoning checkpoint. Traceback (most recent call last): File "main.py", line 893, in raise err File "main.py", line 875, in trainer.fit(model, data) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit self._run(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run self._dispatch() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch self.accelerator.start_training(self) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training self.training_type_plugin.start_training(trainer) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training self._results = trainer.run_stage() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage return self._run_train() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train self.fit_loop.run() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run self.advance(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance epoch_output = self.epoch_loop.run(train_dataloader) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/base.py", line 111, in run self.advance(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 150, in advance "on_train_batch_end", processed_batch_end_outputs, batch, self.iteration_count, self._dataloader_idx File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1217, in call_hook trainer_hook(*args, **kwargs) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/callback_hook.py", line 189, in on_train_batch_end callback.on_train_batch_end(self, self.lightning_module, outputs, batch, batch_idx, dataloader_idx) File "/stable-diffusion/main.py", line 409, in on_train_batch_end self.log_img(pl_module, batch, batch_idx, split="train") File "/stable-diffusion/main.py", line 377, in log_img images = pl_module.log_images(batch, split=split, **self.log_images_kwargs) File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/stable-diffusion/ldm/models/diffusion/ddpm.py", line 1323, in log_images with ema_scope("Sampling"): File "/opt/conda/lib/python3.7/contextlib.py", line 112, in __enter__ return next(self.gen) File "/stable-diffusion/ldm/models/diffusion/ddpm.py", line 182, in ema_scope self.model_ema.store(self.model.parameters()) File "/stable-diffusion/ldm/modules/ema.py", line 62, in store self.collected_params = [param.clone() for param in parameters] File "/stable-diffusion/ldm/modules/ema.py", line 62, in self.collected_params = [param.clone() for param in parameters] RuntimeError: CUDA out of memory. Tried to allocate 58.00 MiB (GPU 1; 23.70 GiB total capacity; 21.90 GiB already allocated; 10.56 MiB free; 22.07 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF ```

EMA is happening on GPU, which would explain the OOM issue. Based on my past experience, this is often done on CPU to save GPU memory (Example from timm, Another Example from HF) I'm gonna attempt making it happen on CPU, will post noteworthy updates here.

Curious if you have any more insights regarding this. Cheers!

justinpinkney commented 2 years ago

Hmmm I think it used around 40gb of vram when training with a batch size of 4.

Yeah looks like the ema is adding some extra weight which might be the issue.

You could also try reduced precision training. The repo uses pytorch lightning so should be straightforward to try mixed precision training. Couldn't comment on the stability of that though...

rsomani95 commented 2 years ago

I did try precision=16 but failed with some hard to diagnose errors. Didn't bother trying to solve that.

40gb of vram when training with a batch size of 4

I see. I think I can confidently say that setting up the model + EMA on GPU alone takes > 24 GB ram in its current state. Moving EMA to CPU would probably allow you to increase batch size by a reasonable amount

rsomani95 commented 2 years ago

Modified EMA as follows:

```python import torch from torch import nn class LitEma(nn.Module): def __init__(self, model, decay=0.9999, use_num_upates=True, device="cpu"): super().__init__() if decay < 0.0 or decay > 1.0: raise ValueError('Decay must be between 0 and 1') self.device = device self.m_name2s_name = {} self.register_buffer('decay', torch.tensor(decay, dtype=torch.float32)) self.register_buffer('num_updates', torch.tensor(0,dtype=torch.int) if use_num_upates else torch.tensor(-1,dtype=torch.int)) for name, p in model.named_parameters(): if p.requires_grad: #remove as '.'-character is not allowed in buffers s_name = name.replace('.','') self.m_name2s_name.update({name:s_name}) self.register_buffer(s_name,p.clone().detach().data) # Move above buffer to device setattr(self, s_name, getattr(self, s_name).to(self.device)) self.collected_params = [] def forward(self,model): decay = self.decay if self.num_updates >= 0: self.num_updates += 1 decay = min(self.decay,(1 + self.num_updates) / (10 + self.num_updates)) one_minus_decay = 1.0 - decay with torch.no_grad(): m_param = dict(model.named_parameters()) shadow_params = dict(self.named_buffers()) for key in m_param: if m_param[key].requires_grad: sname = self.m_name2s_name[key] shadow_params[sname] = shadow_params[sname].type_as(m_param[key]).to(self.device) shadow_params[sname].sub_(one_minus_decay * (shadow_params[sname] - m_param[key])) else: assert not key in self.m_name2s_name def copy_to(self, model): m_param = dict(model.named_parameters()) shadow_params = dict(self.named_buffers()) for key in m_param: if m_param[key].requires_grad: m_param[key].data.copy_(shadow_params[self.m_name2s_name[key]].data) else: assert not key in self.m_name2s_name def store(self, parameters): """ Save the current parameters for restoring later. Args: parameters: Iterable of `torch.nn.Parameter`; the parameters to be temporarily stored. """ # breakpoint() self.collected_params = [param.clone().to(self.device) for param in parameters] def restore(self, parameters): """ Restore the parameters stored with the `store` method. Useful to validate the model with EMA parameters without affecting the original optimization process. Store the parameters before the `copy_to` method. After validation (or model saving), use this to restore the former parameters. Args: parameters: Iterable of `torch.nn.Parameter`; the parameters to be updated with the stored parameters. """ for c_param, param in zip(self.collected_params, parameters): param.data.copy_(c_param.data) ```

But now I fail later in the forward pass with an OOM error. I'm not sure if this can be run on 3090s without heavier modifications which I suspect would be a bit gnarly...

rsomani95 commented 2 years ago

For anyone else trying to run this on 3090s, I was able to run huggingface's WIP script on a single 3090. Couldn't quite get the right accelerate config for multi-GPU yet.

https://github.com/huggingface/diffusers/pull/356

python train_text_to_image.py \
    --dataset_name lambdalabs/pokemon-blip-captions \
    --use_auth_token \
    --mixed_precision fp16 \
    --resolution 512 \
    --gradient_accumulation_steps 4 \
    --train_batch_size 1 \
    --num_train_epochs 1 \
    --output_dir sd-pokemon
osfa commented 2 years ago

haven't it tried it yet, but this might be another avenue for 3090 gang: https://github.com/XavierXiao/Dreambooth-Stable-Diffusion

patil-suraj commented 2 years ago

@rsomani95

To launch on multi-gpu you can use the following command:

accelerate launch --multi_gpu train_text_to_image.py \
    --dataset_name lambdalabs/pokemon-blip-captions \
    --use_auth_token \
    --mixed_precision fp16 \
    --resolution 512 \
    --gradient_accumulation_steps 4 \
    --train_batch_size 1 \
    --num_train_epochs 1 \
    --output_dir sd-pokemon
eeyrw commented 2 years ago

I have a machine with 6x GPU of 12GB ram. Is it possbile launch finetune on such kind of hardware configuration? It seems even 24G 3090 do not working.

lioo717 commented 2 years ago

I just simply set the use_ema=False. Then, the default main.py can be run, but ema model lost. So the 3090 can train the model with a bit more VRAM. I also have another idea,precalculate the condition text to embed space to save GPU VRAM. Crying from a poor lab.

lioo717 commented 2 years ago

I just simply set the use_ema=False. Then, the default main.py can be run, but ema model lost. So the 3090 can train the model with a bit more VRAM. I also have another idea,precalculate the condition text to embed space to save GPU VRAM. Crying from a poor lab.

Sadly, The clip model has little effect on GPU memory.

StrugglingForBetter commented 2 years ago

@patil-suraj Is there any way to train on multiple GPU without using accelerate launch

StrugglingForBetter commented 2 years ago

For anyone else trying to run this on 3090s, I was able to run huggingface's WIP script on a single 3090. Couldn't quite get the right accelerate config for multi-GPU yet.

huggingface/diffusers#356

python train_text_to_image.py \
    --dataset_name lambdalabs/pokemon-blip-captions \
    --use_auth_token \
    --mixed_precision fp16 \
    --resolution 512 \
    --gradient_accumulation_steps 4 \
    --train_batch_size 1 \
    --num_train_epochs 1 \
    --output_dir sd-pokemon

@rsomani95 Have you found any way to train on multiple GPU?

rsomani95 commented 2 years ago

@StrugglingForBetter I did, yes. There's some GPU memory overhead, so it's not possible to train the full model. But, if you freeze the params for part of the model, it fits in memory.


Using EMA on multi-GPU

I have a 3x 3090 setup, and I hard-coded the EMA model to load on torch.device(2) and train on only the first two gpus. In addition to the changes in the script, I set up my accelerate config to only use gpu_ids 0,1 for training

lioo717 commented 2 years ago

@StrugglingForBetter I did, yes. There's some GPU memory overhead, so it's not possible to train the full model. But, if you freeze the params for part of the model, it fits in memory.

Using EMA on multi-GPU

I have a 3x 3090 setup, and I hard-coded the EMA model to load on torch.device(2) and train on only the first two gpus. In addition to the changes in the script, I set up my accelerate config to only use gpu_ids 0,1 for training

could you share the code of putting the EMA model on different gpu?

StrugglingForBetter commented 2 years ago

@StrugglingForBetter I did, yes. There's some GPU memory overhead, so it's not possible to train the full model. But, if you freeze the params for part of the model, it fits in memory.

Using EMA on multi-GPU

I have a 3x 3090 setup, and I hard-coded the EMA model to load on torch.device(2) and train on only the first two gpus. In addition to the changes in the script, I set up my accelerate config to only use gpu_ids 0,1 for training

Thanks for your reply! Can you show us how to set accelerate config? Is there any way to train without accelerate config since I have never used accelerate when training on multiple GPUs before ?

BTW, can you give me your email so that I can contact you more conveiently? Thx!

AIXiaoBaiDemon commented 2 years ago

But now I fail later in the forward pass with an OOM error. I'm not sure if this can be run on 3090s without heavier modifications which I suspect would be a bit gnarly...

I was able to run the code on the 3090 this way, but the results are far from the same, can you get similar results for Pokémon?

StrugglingForBetter commented 1 year ago

For anyone else trying to run this on 3090s, I was able to run huggingface's WIP script on a single 3090. Couldn't quite get the right accelerate config for multi-GPU yet.

huggingface/diffusers#356

python train_text_to_image.py \
    --dataset_name lambdalabs/pokemon-blip-captions \
    --use_auth_token \
    --mixed_precision fp16 \
    --resolution 512 \
    --gradient_accumulation_steps 4 \
    --train_batch_size 1 \
    --num_train_epochs 1 \
    --output_dir sd-pokemon

how to train on multiple GPUs?