googlecolab / colabtools

Python libraries for Google Colaboratory
Apache License 2.0
2.17k stars 706 forks source link

Issue when training Stable DIffusion model using diffusers-based training script with accelerate library #3348

Closed Linaqruf closed 1 year ago

Linaqruf commented 1 year ago

Describe the current behavior A clear and concise explanation of what is currently happening.

I can't train Stable Diffusion model using diffusers-based training script since ubuntu update this morning,

steps:   0% 0/5000 [00:00<?, ?it/s]epoch 1/3
Traceback (most recent call last):
  File "/usr/local/bin/accelerate", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/accelerate_cli.py", line 45, in main
    args.func(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 1104, in launch_command
    simple_launcher(args)
  File "/usr/local/lib/python3.8/dist-packages/accelerate/commands/launch.py", line 567, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/usr/bin/python3', '/content/kohya-trainer/fine_tune.py', '--pretrained_model_name_or_path=/content/pre_trained_model/Anything-v3-better-vae.ckpt', '--train_data_dir=/content/fine_tune/train_data', '--in_json=/content/fine_tune/meta_lat.json', '--output_dir=/content/fine_tune/output', '--output_name=hito_komoru', '--mixed_precision=fp16', '--save_precision=float', '--save_state', '--save_model_as=ckpt', '--resolution=512', '--train_batch_size=1', '--max_token_length=225', '--use_8bit_adam', '--shuffle_caption', '--xformers', '--learning_rate=2e-06', '--lr_scheduler=constant', '--lr_warmup_steps=0', '--dataset_repeats=1', '--max_train_steps=5000', '--seed=1123', '--gradient_accumulation_steps=1', '--logging_dir=/content/fine_tune/logs', '--log_prefix=hito_komoru']' died with <Signals.SIGKILL: 9>.

I already tried every way to solve the problem like downgrading dependencies, torch, build new xformers, and rewrite the training cell but still doesn't work and reproduce the same error

my requirements.txt:

accelerate==0.15.0
transformers==4.25.1
ftfy
albumentations
opencv-python
einops
diffusers[torch]==0.10.2
pytorch_lightning
bitsandbytes==0.35.0
tensorboard
safetensors==0.2.6
# for BLIP captioning
requests
timm==0.4.12
fairscale==0.4.4
# for WD14 captioning
tensorflow<2.11
huggingface-hub
# for kohya_ss library
.

and another dependency i put in the notebook

  !pip -qqqq install --upgrade -r requirements.txt
  !pip -qqqq install --upgrade gallery-dl
  !pip -qqqq install --upgrade --no-cache-dir gdown
  !apt -qqqq install liblz4-tool aria2

  if Install_xformers:
    !pip -qqqq install -U -I --no-deps https://github.com/camenduru/stable-diffusion-webui-colab/releases/download/0.0.15/xformers-0.0.15.dev0+189828c.d20221207-cp38-cp38-linux_x86_64.whl
  else:
    pass

Describe the expected behavior A clear and concise explanation of what you expected to happen.

Training runs well, i can see the train steps counting, i never facing this error before so i think this is caused by ubuntu update this morning, also i can train without facing any error if i set command pallette to "use fallback runtime version" image

Also why the python version downgraded to 3.8.10? i remember it's 3.8.16 before the ubuntu update?

What web browser you are using Google Chrome

Additional context Link to a minimal, public, self-contained notebook that reproduces this issue.

metrizable commented 1 year ago

@Linaqruf Thanks for using Colab. At first pass, I see one of your steps may install a pre-compiled wheel from a release in https://github.com/camenduru/stable-diffusion-webui-colab/releases (depending on install_transformers, which the notebook you shared has set to True). It may be worth reaching out to the upstream maintainer to understand if their wheel is compatible with the current version of Colab.

Fannovel16 commented 1 year ago

I just tested that notebook. That wheel is still compatible with the current version of Colab so I don't think it is the problem here. I found out that the RAM usage go unusually high when the program load Stable Diffusion checkpoint even though it only has 2GB in size. This bug also only appeared when I used current Colab runtime version.

TheLastBen commented 1 year ago

@metrizable Since the last few days, the RAM usage is inexplicably high in Colab, especially when using Stable Diffusion, what changed ?

geocine commented 1 year ago

@metrizable Since the last few days, the RAM usage is inexplicably high in Colab, especially when using Stable Diffusion, what changed ?

I agree with this, the RAM usage get so high to the point that its not freeing up I noticed this especially during inference stage on my Colabs

metrizable commented 1 year ago

@Linaqruf We've recently released a change to how we use tcmalloc in the latest Colab runtime. I've added content to: https://github.com/googlecolab/colabtools/issues/3363#issuecomment-1421405493

Daviljoe193 commented 1 year ago

It did something, I guess...

slightl

It's still not working the way it used to, though, as there's a large amount system ram that it could be freeing, but isn't, which previously wasn't an issue pre 20.04 rollout. This was in one the Stable Diffusion WebUI notebooks, though the exact same thing happens in @aadnk's notebook for OpenAI Whisper.

metrizable commented 1 year ago

@Daviljoe193 Are you able to provide a minimal repro notebook for the behavior from your comment?

@Linaqruf I just re-ran the notebook from your OP ("kohya-trainer.ipynb") to success. However, I did make one modification and ran the non-optional cells. Below is a transcript of the activity:

1. Execute cell 1.1.

2. Update cell 1.2 specifying latest xformers, and execute it.

There is a newer xformers pre-compiled wheel for Colab T4 released after Ubuntu 20.04 upgrade. Use this release to eliminate a possible contributing factor.

oxSNojCTSoerBAa

3. Execute non-optional cells 1.3.1 (specifying Huggingface token), 2.1, 3.1 and 3.2.

8PrBXeWobMfLgSJ

4. Execute non-optional cells 4.3, 4.4.1, 4.4.2, 4.5, and 4.6.

6Xi7XGJh5V3dVZj

5. Execute cells 5.1 and 5.2.

XFiyoo3YL5sZ5Ch 8ApTch95ouZSajw

6. Execute cell 6.1.

4RSa4CvnFqS2f5E

I did not see the errors reported in the OP. As well, throughout execution, I observed that System RAM was less than GPU RAM, except briefly in the final inference step (see 6.1 above).

With the somewhat lengthy notebook provided, and the various pre-compiled wheels, models, and scripts that are downloaded and executed by it, there may be some conflating factors involved. If it's possible to provide a minimal repro notebook, it may help in isolating and root causing where you are experiencing issues.

Daviljoe193 commented 1 year ago

@metrizable For Stable Diffusion's webui, I used a slightly modified version of @Camenduru's 2.1 notebook, so that it instead uses the upstream webui rather than his fork, and doesn't apply his workaround, as this is much closer to what I used before the 20.04 rollout happened, and the workaround complicates troubleshooting.

As for OpenAI Whisper, I used this notebook. Both notebooks exibit the same behavior, and especially with the Whisper notebook (It's doable with the SD notebook, but you need more than one model, which Camenduru's notebooks aren't intended to do), by using the Large-V2 model for one transcription, then afterwards using Large-V1, then back to Large-V2, you'll eventually end up draining all of the system ram. How both of these used to work is that they would load the model, which would take a lot of system ram, but then after it's loaded into the GPU, said system ram would be almost entirely freed.

Linaqruf commented 1 year ago

Thank you for your reply, sir.

Just want to report, I am unable to reproduce the bugs now, which is strange. I am now able to load the full EMA model without any crashes, whereas previously I was unable to load even a 2GB model.

image For the past two weeks, I have been loading the model into the VRAM instead of the CPU, as the training would always crash when loading the model into RAM. I have made several changes to the notebook and the backend script from kohya-ss/sd-scripts, so it's possible something has changed since then. I believe the issue is not with training, but with loading the stable diffusion model, because training used gpu instead of cpu

I think I'll close this issue for now, however, some of the users are still facing this issue, as seen in #33 and #60.