Slow TI training compared to Automatic1111

DarkAlchy commented 1 year ago

Over twice as slow using 512x512 and not Auto's 768x768. My gpu is barely being touched while it is 100% in Automatic1111.

edit: Same exact training in Automatic1111

TEN times slower with kohya_ss, but why?

Automatic while training Kohya_ss while training It is sitting there doing what I have no idea in comparison to Automatic1111 when looking at the CUDA hits.

kohya-ss commented 1 year ago

bitsandbytes might not work correctly. Could you try to replace .dll file according to following comment?

https://github.com/kohya-ss/sd-scripts/issues/44#issuecomment-1375690372

DarkAlchy commented 1 year ago

bitsandbytes might not work correctly. Could you try to replace .dll file according to following comment?

#44 (comment)

Already did that and it loaded it. I can tell you on a 4090 it is (for 1 image) 40s per it? Now that is beyond hideous AND, strangely enough, my 1060 was 45s. Person I just set up is also using bitsandbytes on their 4090 and the first thing they said to me was how ungodly slow this was as if it were using the CPU. Fact is he gets 35 IT/s using Automatic1111 TI embedding trainer while on yours his 4090 gets 40 SECONDS/it. Something is terribly wrong where it affects us both.

1 train images with repeating. use template for training captions. is object: {args.use_object_template} loading image sizes. 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 142.73it/s] prepare dataset Replace CrossAttention.forward to use xformers prepare optimizer, data loader etc. CUDA SETUP: Loading binary D:\kohya_ss\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cudaall.dll... use 8-bit Adam optimizer 49416 tensor(49408) running training / 学習開始 num train images * repeats / 学習画像の数×繰り返し回数: 1 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 1 num epochs / epoch数: 10 batch size per device / バッチサイズ: 1 total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ（並列学習、勾配合計含む）: 1 gradient ccumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 10 steps: 0%| | 0/10 [00:00<?, ?it/s]epoch 1/10 steps: 10%|██████▎ | 1/10 [00:46<07:02, 46.98s/it, loss=0.0188]torch.Size([8, 768]) torch.Size([8, 768]) tensor(0., device='cuda:0') tensor(0., device='cuda:0') epoch 2/10 steps: 20%|████████████▊ | 2/10 [01:32<06:10, 46.34s/it, loss=0.176]torch.Size([8, 768]) torch.Size([8, 768]) tensor(7.9423e-06, device='cuda:0') tensor(-0.0010, device='cuda:0') epoch 3/10 steps: 30%|███████████████████▏ | 3/10 [02:18<05:22, 46.02s/it, loss=0.701]torch.Size([8, 768]) torch.Size([8, 768]) tensor(5.8421e-06, device='cuda:0') tensor(-0.0010, device='cuda:0') epoch 4/10 steps: 40%|█████████████████████████▌ | 4/10 [03:04<04:36, 46.02s/it, loss=0.136]torch.Size([8, 768]) torch.Size([8, 768]) tensor(-1.2490e-08, device='cuda:0') tensor(-0.0009, device='cuda:0')

kohya-ss commented 1 year ago

If you've installed CUDA 12, please uninstall it and re-install CUDA 11.x (11.6 and 11.8 are working on my env.)

If you are already using CUDA 11, something seems to be wrong. Could you please copy and paste the command line to run the training and the result of pip list?

DarkAlchy commented 1 year ago

D:\kohya_ss>".\venv\scripts\activate"

(venv) D:\kohya_ss>pip list Package Version

absl-py 1.4.0 accelerate 0.15.0 aiohttp 3.8.3 aiosignal 1.3.1 albumentations 1.3.0 altair 4.2.1 anyio 3.6.2 astunparse 1.6.3 async-timeout 4.0.2 attrs 22.2.0 bitsandbytes 0.35.0 cachetools 5.3.0 certifi 2022.12.7 charset-normalizer 2.1.1 click 8.1.3 colorama 0.4.6 contourpy 1.0.7 cycler 0.11.0 diffusers 0.10.2 easygui 0.98.3 einops 0.6.0 entrypoints 0.4 fairscale 0.4.13 fastapi 0.89.1 ffmpy 0.3.0 filelock 3.9.0 flatbuffers 23.1.21 fonttools 4.38.0 frozenlist 1.3.3 fsspec 2023.1.0 ftfy 6.1.1 gast 0.4.0 google-auth 2.16.0 google-auth-oauthlib 0.4.6 google-pasta 0.2.0 gradio 3.15.0 grpcio 1.51.1 h11 0.14.0 h5py 3.8.0 httpcore 0.16.3 httpx 0.23.3 huggingface-hub 0.12.0 idna 3.4 imageio 2.25.0 importlib-metadata 6.0.0 Jinja2 3.1.2 joblib 1.2.0 jsonschema 4.17.3 keras 2.10.0 Keras-Preprocessing 1.1.2 kiwisolver 1.4.4 libclang 15.0.6.1 library 1.0.2 lightning-utilities 0.6.0.post0 linkify-it-py 1.0.3 Markdown 3.4.1 markdown-it-py 2.1.0 MarkupSafe 2.1.2 matplotlib 3.6.3 mdit-py-plugins 0.3.3 mdurl 0.1.2 multidict 6.0.4 networkx 3.0 numpy 1.24.1 oauthlib 3.2.2 opencv-python 4.7.0.68 opencv-python-headless 4.7.0.68 opt-einsum 3.3.0 orjson 3.8.5 packaging 23.0 pandas 1.5.3 Pillow 9.4.0 pip 22.3.1 protobuf 3.19.6 psutil 5.9.4 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycryptodome 3.16.0 pydantic 1.10.4 pydub 0.25.1 pyparsing 3.0.9 pyrsistent 0.19.3 python-dateutil 2.8.2 python-multipart 0.0.5 pytorch-lightning 1.9.0 pytz 2022.7.1 PyWavelets 1.4.1 PyYAML 6.0 qudida 0.0.4 regex 2022.10.31 requests 2.28.2 requests-oauthlib 1.3.1 rfc3986 1.5.0 rsa 4.9 safetensors 0.2.6 scikit-image 0.19.3 scikit-learn 1.2.1 scipy 1.10.0 setuptools 63.2.0 six 1.16.0 sniffio 1.3.0 starlette 0.22.0 tensorboard 2.10.1 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 tensorflow 2.10.1 tensorflow-estimator 2.10.0 tensorflow-io-gcs-filesystem 0.30.0 termcolor 2.2.0 threadpoolctl 3.1.0 tifffile 2023.1.23.1 timm 0.6.12 tk 0.1.0 tokenizers 0.13.2 toolz 0.12.0 torch 1.12.1+cu116 torchmetrics 0.11.0 torchvision 0.13.1+cu116 tqdm 4.64.1 transformers 4.25.1 typing_extensions 4.4.0 uc-micro-py 1.0.1 urllib3 1.26.14 uvicorn 0.20.0 wcwidth 0.2.6 websockets 10.4 Werkzeug 2.2.2 wheel 0.38.4 wrapt 1.14.1 xformers 0.0.14.dev0 yarl 1.8.2 zipp 3.11.0

Folder 1_testing person: 1 steps max_train_steps = 100 stop_text_encoder_training = 0 lr_warmup_steps = 10 accelerate launch --num_cpu_threads_per_process=2 "train_textual_inversion.py" --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" --train_data_dir="E:\dest\img" --resolution=512,512 --output_dir="E:\dest\model" --logging_dir="E:\dest\log" --save_model_as=ckpt --output_name="test" --learning_rate="1e-3" --lr_scheduler="cosine" --lr_warmup_steps="10" --train_batch_size="1" --max_train_steps="100" --save_every_n_epochs="10000" --mixed_precision="fp16" --save_precision="fp16" --seed="1234" --xformers --use_8bit_adam --token_string=test --init_word=* --num_vectors_per_token=8 --use_object_template

kohya-ss commented 1 year ago

I've tested with same packages and the same command on my RTX 4090, and it seems to work fine.

(I modified token_string because test is in tokenizer. And CUDNN dlls in venv\Lib\site-packages\torch\lib are updated to dlls from CUDNN 8.6.)

Perhaps there is not enough main memory. Adding --max_data_loader_n_workers=1 to the command might reduce the memory usage.

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
CUDA SETUP: Loading binary C:\Tmp\issue125\sd-scripts\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll...
use 8-bit Adam optimizer
49416 tensor(49408)
running training / 学習開始
  num train images * repeats / 学習画像の数×繰り返し回数: 150
  num reg images / 正則化画像の数: 0
  num batches per epoch / 1epochのバッチ数: 150
  num epochs / epoch数: 1
  batch size per device / バッチサイズ: 1
  total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ（並列学習、勾配合計含む）: 1
  gradient ccumulation steps / 勾配を合計するステップ数 = 1
  total optimization steps / 学習ステップ数: 100
steps:   0%|                                                                                   | 0/100 [00:00<?, ?it/s]epoch 1/1
steps: 100%|████████████████████████████████████████████████████████████| 100/100 [00:37<00:00,  2.69it/s, loss=0.0468]torch.Size([8, 768]) torch.Size([8, 768]) tensor(3.7472e-05, device='cuda:0') tensor(-0.0315, device='cuda:0')
save trained model to R:\dest\model\test.ckpt
model saved.
steps: 100%|████████████████████████████████████████████████████████████| 100/100 [00:37<00:00,  2.63it/s, loss=0.0468]

Tue Jan 31 08:20:10 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 527.56       Driver Version: 527.56       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ... WDDM  | 00000000:01:00.0 Off |                  Off |
|  0%   37C    P2   171W / 337W |   5175MiB / 24564MiB |     85%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     12216      C   ...thon\Python310\python.exe    N/A      |
+-----------------------------------------------------------------------------+

DarkAlchy commented 1 year ago

I can tell you a 4090 user (the one I mentioned earlier that you may have skipped over reading) is having the EXACT same issue as I am having so you saying this, and what I saw from him, something needs to get fixed somewhere. This already is using 1 gig less than Automatic1111 so it isn't a vram issue at fault here.

malicorX commented 1 year ago

hi,

i for sure got enough ram (64gb normal ram and 24gb vram from the 4090)

malicorX commented 1 year ago

cuda 12 is deinstalled, and i installed cuda 11.7.0

DarkAlchy commented 1 year ago

There is my friend I was helping with the 4090 and the same results as I was getting. You seriously have an issue with this software, so let's figure out where it is.

kohya-ss commented 1 year ago

I am using CUDA 11.8.0. Could you please ask your friend to check the CUDA version?

DarkAlchy commented 1 year ago

I am using CUDA 11.8.0. Could you please ask your friend to check the CUDA version?

As he said above 11.7. He just went to bed but it is 11.7. My cuda version is also 11.7 too.

kohya-ss commented 1 year ago

I think CUDA 11.7 may work. How much main RAM (not VRAM) is available during script execution? If swapping occurs, the training speed will be greatly reduced.

DarkAlchy commented 1 year ago

I think CUDA 11.7 may work. How much main RAM (not VRAM) is available during script execution? If swapping occurs, the training speed will be greatly reduced.

He has 64GB I have 32GB.

btw, I think you are overlooking the simple fact that this works with TI training in Automatic1111 at 10x the speed of yours and his uses 1 gig more of vram.

kohya-ss commented 1 year ago

Unfortunately, the script implementations are so different that it seems difficult to make a simple comparison.

Can you try it removing --use_8bit_adam option and adding --cache_latents option? If there is something wrong with bitsandbytes, that might work.

DarkAlchy commented 1 year ago

Unfortunately, the script implementations are so different that it seems difficult to make a simple comparison.

Can you try it removing --use_8bit_adam option and adding --cache_latents option? If there is something wrong with bitsandbytes, that might work.

I tried that (I tried EVERYTHING before I posted this ticket) and no difference in speed.

kohya-ss commented 1 year ago

Removing --xformers and adding --mem_eff_attn also did not work?

DarkAlchy commented 1 year ago

Removing --xformers and adding --mem_eff_attn also did not work?

Just tried that and it made it only twice as slow as Automatic1111 (down from ~46s to 7.5s).

kohya-ss commented 1 year ago

Ok thanks! There seems to be a problem with xformers. Unfortunately, I don't have a GTX10xx environment, so I can't test it, but the official xformers wheel seems to be on the following page. windows-2019-py3.10-torch1.12.1+cu116.zip might be suitable: https://github.com/facebookresearch/xformers/actions/runs/3696288469

Uninstalling the current xformers, and installing it may work.

DarkAlchy commented 1 year ago

Ok thanks! There seems to be a problem with xformers. Unfortunately, I don't have a GTX10xx environment, so I can't test it, but the official xformers wheel seems to be on the following page. windows-2019-py3.10-torch1.12.1+cu116.zip might be suitable: https://github.com/facebookresearch/xformers/actions/runs/3696288469

Uninstalling the current xformers, and installing it may work.

Doesn't explain the same issue on his 4090 though, and the cross attention memory deal isn't supposed to be used with a 24GB card from everything I have read. Absolutely amazing that Automatic1111 got this to work with the same xformers version you are using though.

kohya-ss commented 1 year ago

It certainly does not explain his issue, but the version of xformers in README.md seems to work in my 4090...

Apparently xformers is very sensitive to the environment, so there is some environment-dependent problem.

DarkAlchy commented 1 year ago

It certainly does not explain his issue, but the version of xformers in README.md seems to work in my 4090...

Apparently xformers is very sensitive to the environment, so there is some environment-dependent problem.

I believe so too.

malicorX commented 1 year ago

just one further information here:

what we tried is to train a v1 not a v2, does that work on your 4090, too ?

official-elinas commented 1 year ago

@kohya-ss have you seen https://github.com/bmaltais/kohya_ss/issues/36 which is about the long pauses between epochs?

I tried without xformers and it makes no difference in regards to this issue.

DamonianoStudios commented 1 year ago

I was able to improve performance with this config setup

"pretrained_model_name_or_path": "stabilityai/stable-diffusion-2-1", "v2": true, "v_parameterization": true, "logging_dir": "D:\AI\References\IndianWoman\Training\log", "train_data_dir": "D:\AI\References\IndianWoman\Training\img", "reg_data_dir": "", "output_dir": "D:\AI\References\IndianWoman\Training\model", "max_resolution": "768,768", "learning_rate": ".005", "lr_scheduler": "constant", "lr_warmup": "0", "train_batch_size": 1, "epoch": "2", "save_every_n_epochs": "1", "mixed_precision": "bf16", "save_precision": "bf16", "seed": "42069", "num_cpu_threads_per_process": 2, "cache_latents": true, "caption_extension": ".txt", "enable_bucket": true, "gradient_checkpointing": false, "full_fp16": false, "no_token_padding": false, "stop_text_encoder_training": 0, "use_8bit_adam": true, "xformers": true, "save_model_as": "ckpt", "shuffle_caption": false, "save_state": false, "resume": "", "prior_loss_weight": 1.0, "color_aug": false, "flip_aug": false, "clip_skip": 2, "vae": "", "output_name": "IndianWoman", "max_token_length": "225", "max_train_epochs": "", "max_data_loader_n_workers": "1", "mem_eff_attn": false, "gradient_accumulation_steps": 1.0, "model_list": "stabilityai/stable-diffusion-2-1", "token_string": "IndiWoman", "init_word": "woman", "num_vectors_per_token": 8, "max_train_steps": "", "weights": "", "template": "caption", "keep_tokens": "0" }

I used 16 images at 100 repeats with this set up and had 1.07it/s on a 3090 with these settings.

Hope this helps anyone else

Same settings in the image besides the epochs for people that are more visual

kohya-ss commented 1 year ago

@kohya-ss have you seen bmaltais/kohya_ss#36 which is about the long pauses between epochs?

I tried without xformers and it makes no difference in regards to this issue.

--persistent_data_loader_workers will be solve the long pauses issue. Please try the option.

BowenBao commented 1 year ago

Root cause is pytorch/pytorch#12831. On windows the multiprocess dataloader is unusable. Extremely large overhead when starting each new worker.

Hence as @kohya-ss mentioned, the problem can be solved by either setting --persistent_data_loader_workers to reduce the large overhead to only once at the start of training, or setting --max_data_loader_n_workers 0 to not trigger multiprocess dataloading.

This problem only occurs on windows. I wonder if that's the case with you @DarkAlchy

DarkAlchy commented 1 year ago

@BowenBao Not sure to be honest, but I moved on and away from embeddings into LoRA. I tried the workers Kohya suggest to no avail and gave up on this.

pastuh commented 1 year ago

Can someone explain, for RTX4090 do I need enable "use xformers" or not?

kohya-ss commented 1 year ago

xformers is not mandatory. If you have xformers enabled and training does not work, turn it off.

kohya-ss / sd-scripts

Slow TI training compared to Automatic1111 #125