Closed DarkAlchy closed 1 year ago
bitsandbytes
might not work correctly. Could you try to replace .dll
file according to following comment?
https://github.com/kohya-ss/sd-scripts/issues/44#issuecomment-1375690372
bitsandbytes
might not work correctly. Could you try to replace.dll
file according to following comment?
Already did that and it loaded it. I can tell you on a 4090 it is (for 1 image) 40s per it? Now that is beyond hideous AND, strangely enough, my 1060 was 45s. Person I just set up is also using bitsandbytes on their 4090 and the first thing they said to me was how ungodly slow this was as if it were using the CPU. Fact is he gets 35 IT/s using Automatic1111 TI embedding trainer while on yours his 4090 gets 40 SECONDS/it. Something is terribly wrong where it affects us both.
1 train images with repeating. use template for training captions. is object: {args.use_object_template} loading image sizes. 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 142.73it/s] prepare dataset Replace CrossAttention.forward to use xformers prepare optimizer, data loader etc. CUDA SETUP: Loading binary D:\kohya_ss\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cudaall.dll... use 8-bit Adam optimizer 49416 tensor(49408) running training / 学習開始 num train images * repeats / 学習画像の数×繰り返し回数: 1 num reg images / 正則化画像の数: 0 num batches per epoch / 1epochのバッチ数: 1 num epochs / epoch数: 10 batch size per device / バッチサイズ: 1 total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ(並列学習、勾配合計含む): 1 gradient ccumulation steps / 勾配を合計するステップ数 = 1 total optimization steps / 学習ステップ数: 10 steps: 0%| | 0/10 [00:00<?, ?it/s]epoch 1/10 steps: 10%|██████▎ | 1/10 [00:46<07:02, 46.98s/it, loss=0.0188]torch.Size([8, 768]) torch.Size([8, 768]) tensor(0., device='cuda:0') tensor(0., device='cuda:0') epoch 2/10 steps: 20%|████████████▊ | 2/10 [01:32<06:10, 46.34s/it, loss=0.176]torch.Size([8, 768]) torch.Size([8, 768]) tensor(7.9423e-06, device='cuda:0') tensor(-0.0010, device='cuda:0') epoch 3/10 steps: 30%|███████████████████▏ | 3/10 [02:18<05:22, 46.02s/it, loss=0.701]torch.Size([8, 768]) torch.Size([8, 768]) tensor(5.8421e-06, device='cuda:0') tensor(-0.0010, device='cuda:0') epoch 4/10 steps: 40%|█████████████████████████▌ | 4/10 [03:04<04:36, 46.02s/it, loss=0.136]torch.Size([8, 768]) torch.Size([8, 768]) tensor(-1.2490e-08, device='cuda:0') tensor(-0.0009, device='cuda:0')
If you've installed CUDA 12, please uninstall it and re-install CUDA 11.x (11.6 and 11.8 are working on my env.)
If you are already using CUDA 11, something seems to be wrong. Could you please copy and paste the command line to run the training and the result of pip list
?
D:\kohya_ss>".\venv\scripts\activate"
(venv) D:\kohya_ss>pip list Package Version
absl-py 1.4.0 accelerate 0.15.0 aiohttp 3.8.3 aiosignal 1.3.1 albumentations 1.3.0 altair 4.2.1 anyio 3.6.2 astunparse 1.6.3 async-timeout 4.0.2 attrs 22.2.0 bitsandbytes 0.35.0 cachetools 5.3.0 certifi 2022.12.7 charset-normalizer 2.1.1 click 8.1.3 colorama 0.4.6 contourpy 1.0.7 cycler 0.11.0 diffusers 0.10.2 easygui 0.98.3 einops 0.6.0 entrypoints 0.4 fairscale 0.4.13 fastapi 0.89.1 ffmpy 0.3.0 filelock 3.9.0 flatbuffers 23.1.21 fonttools 4.38.0 frozenlist 1.3.3 fsspec 2023.1.0 ftfy 6.1.1 gast 0.4.0 google-auth 2.16.0 google-auth-oauthlib 0.4.6 google-pasta 0.2.0 gradio 3.15.0 grpcio 1.51.1 h11 0.14.0 h5py 3.8.0 httpcore 0.16.3 httpx 0.23.3 huggingface-hub 0.12.0 idna 3.4 imageio 2.25.0 importlib-metadata 6.0.0 Jinja2 3.1.2 joblib 1.2.0 jsonschema 4.17.3 keras 2.10.0 Keras-Preprocessing 1.1.2 kiwisolver 1.4.4 libclang 15.0.6.1 library 1.0.2 lightning-utilities 0.6.0.post0 linkify-it-py 1.0.3 Markdown 3.4.1 markdown-it-py 2.1.0 MarkupSafe 2.1.2 matplotlib 3.6.3 mdit-py-plugins 0.3.3 mdurl 0.1.2 multidict 6.0.4 networkx 3.0 numpy 1.24.1 oauthlib 3.2.2 opencv-python 4.7.0.68 opencv-python-headless 4.7.0.68 opt-einsum 3.3.0 orjson 3.8.5 packaging 23.0 pandas 1.5.3 Pillow 9.4.0 pip 22.3.1 protobuf 3.19.6 psutil 5.9.4 pyasn1 0.4.8 pyasn1-modules 0.2.8 pycryptodome 3.16.0 pydantic 1.10.4 pydub 0.25.1 pyparsing 3.0.9 pyrsistent 0.19.3 python-dateutil 2.8.2 python-multipart 0.0.5 pytorch-lightning 1.9.0 pytz 2022.7.1 PyWavelets 1.4.1 PyYAML 6.0 qudida 0.0.4 regex 2022.10.31 requests 2.28.2 requests-oauthlib 1.3.1 rfc3986 1.5.0 rsa 4.9 safetensors 0.2.6 scikit-image 0.19.3 scikit-learn 1.2.1 scipy 1.10.0 setuptools 63.2.0 six 1.16.0 sniffio 1.3.0 starlette 0.22.0 tensorboard 2.10.1 tensorboard-data-server 0.6.1 tensorboard-plugin-wit 1.8.1 tensorflow 2.10.1 tensorflow-estimator 2.10.0 tensorflow-io-gcs-filesystem 0.30.0 termcolor 2.2.0 threadpoolctl 3.1.0 tifffile 2023.1.23.1 timm 0.6.12 tk 0.1.0 tokenizers 0.13.2 toolz 0.12.0 torch 1.12.1+cu116 torchmetrics 0.11.0 torchvision 0.13.1+cu116 tqdm 4.64.1 transformers 4.25.1 typing_extensions 4.4.0 uc-micro-py 1.0.1 urllib3 1.26.14 uvicorn 0.20.0 wcwidth 0.2.6 websockets 10.4 Werkzeug 2.2.2 wheel 0.38.4 wrapt 1.14.1 xformers 0.0.14.dev0 yarl 1.8.2 zipp 3.11.0
Folder 1_testing person: 1 steps max_train_steps = 100 stop_text_encoder_training = 0 lr_warmup_steps = 10 accelerate launch --num_cpu_threads_per_process=2 "train_textual_inversion.py" --pretrained_model_name_or_path="runwayml/stable-diffusion-v1-5" --train_data_dir="E:\dest\img" --resolution=512,512 --output_dir="E:\dest\model" --logging_dir="E:\dest\log" --save_model_as=ckpt --output_name="test" --learning_rate="1e-3" --lr_scheduler="cosine" --lr_warmup_steps="10" --train_batch_size="1" --max_train_steps="100" --save_every_n_epochs="10000" --mixed_precision="fp16" --save_precision="fp16" --seed="1234" --xformers --use_8bit_adam --token_string=test --init_word=* --num_vectors_per_token=8 --use_object_template
I've tested with same packages and the same command on my RTX 4090, and it seems to work fine.
(I modified token_string
because test
is in tokenizer. And CUDNN dlls in venv\Lib\site-packages\torch\lib
are updated to dlls from CUDNN 8.6.)
Perhaps there is not enough main memory. Adding --max_data_loader_n_workers=1
to the command might reduce the memory usage.
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
For effortless bug reporting copy-paste your error into this form: https://docs.google.com/forms/d/e/1FAIpQLScPB8emS3Thkp66nvqwmjTEgxp8Y9ufuWTzFyr9kJ5AoI47dQ/viewform?usp=sf_link
================================================================================
CUDA SETUP: Loading binary C:\Tmp\issue125\sd-scripts\venv\lib\site-packages\bitsandbytes\libbitsandbytes_cuda116.dll...
use 8-bit Adam optimizer
49416 tensor(49408)
running training / 学習開始
num train images * repeats / 学習画像の数×繰り返し回数: 150
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 150
num epochs / epoch数: 1
batch size per device / バッチサイズ: 1
total train batch size (with parallel & distributed & accumulation) / 総バッチサイズ(並列学習、勾配合計含む): 1
gradient ccumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 100
steps: 0%| | 0/100 [00:00<?, ?it/s]epoch 1/1
steps: 100%|████████████████████████████████████████████████████████████| 100/100 [00:37<00:00, 2.69it/s, loss=0.0468]torch.Size([8, 768]) torch.Size([8, 768]) tensor(3.7472e-05, device='cuda:0') tensor(-0.0315, device='cuda:0')
save trained model to R:\dest\model\test.ckpt
model saved.
steps: 100%|████████████████████████████████████████████████████████████| 100/100 [00:37<00:00, 2.63it/s, loss=0.0468]
Tue Jan 31 08:20:10 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 527.56 Driver Version: 527.56 CUDA Version: 12.0 |
|-------------------------------+----------------------+----------------------+
| GPU Name TCC/WDDM | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... WDDM | 00000000:01:00.0 Off | Off |
| 0% 37C P2 171W / 337W | 5175MiB / 24564MiB | 85% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 12216 C ...thon\Python310\python.exe N/A |
+-----------------------------------------------------------------------------+
I can tell you a 4090 user (the one I mentioned earlier that you may have skipped over reading) is having the EXACT same issue as I am having so you saying this, and what I saw from him, something needs to get fixed somewhere. This already is using 1 gig less than Automatic1111 so it isn't a vram issue at fault here.
hi,
i for sure got enough ram (64gb normal ram and 24gb vram from the 4090)
cuda 12 is deinstalled, and i installed cuda 11.7.0
There is my friend I was helping with the 4090 and the same results as I was getting. You seriously have an issue with this software, so let's figure out where it is.
I am using CUDA 11.8.0. Could you please ask your friend to check the CUDA version?
I am using CUDA 11.8.0. Could you please ask your friend to check the CUDA version?
As he said above 11.7. He just went to bed but it is 11.7. My cuda version is also 11.7 too.
I think CUDA 11.7 may work. How much main RAM (not VRAM) is available during script execution? If swapping occurs, the training speed will be greatly reduced.
I think CUDA 11.7 may work. How much main RAM (not VRAM) is available during script execution? If swapping occurs, the training speed will be greatly reduced.
He has 64GB I have 32GB.
btw, I think you are overlooking the simple fact that this works with TI training in Automatic1111 at 10x the speed of yours and his uses 1 gig more of vram.
Unfortunately, the script implementations are so different that it seems difficult to make a simple comparison.
Can you try it removing --use_8bit_adam
option and adding --cache_latents
option? If there is something wrong with bitsandbytes, that might work.
Unfortunately, the script implementations are so different that it seems difficult to make a simple comparison.
Can you try it removing
--use_8bit_adam
option and adding--cache_latents
option? If there is something wrong with bitsandbytes, that might work.
I tried that (I tried EVERYTHING before I posted this ticket) and no difference in speed.
Removing --xformers
and adding --mem_eff_attn
also did not work?
Removing
--xformers
and adding--mem_eff_attn
also did not work?
Just tried that and it made it only twice as slow as Automatic1111 (down from ~46s to 7.5s).
Ok thanks! There seems to be a problem with xformers. Unfortunately, I don't have a GTX10xx environment, so I can't test it, but the official xformers wheel seems to be on the following page. windows-2019-py3.10-torch1.12.1+cu116.zip
might be suitable:
https://github.com/facebookresearch/xformers/actions/runs/3696288469
Uninstalling the current xformers, and installing it may work.
Ok thanks! There seems to be a problem with xformers. Unfortunately, I don't have a GTX10xx environment, so I can't test it, but the official xformers wheel seems to be on the following page.
windows-2019-py3.10-torch1.12.1+cu116.zip
might be suitable: https://github.com/facebookresearch/xformers/actions/runs/3696288469Uninstalling the current xformers, and installing it may work.
Doesn't explain the same issue on his 4090 though, and the cross attention memory deal isn't supposed to be used with a 24GB card from everything I have read. Absolutely amazing that Automatic1111 got this to work with the same xformers version you are using though.
It certainly does not explain his issue, but the version of xformers in README.md
seems to work in my 4090...
Apparently xformers is very sensitive to the environment, so there is some environment-dependent problem.
It certainly does not explain his issue, but the version of xformers in
README.md
seems to work in my 4090...Apparently xformers is very sensitive to the environment, so there is some environment-dependent problem.
I believe so too.
just one further information here:
what we tried is to train a v1 not a v2, does that work on your 4090, too ?
@kohya-ss have you seen https://github.com/bmaltais/kohya_ss/issues/36 which is about the long pauses between epochs?
I tried without xformers and it makes no difference in regards to this issue.
I was able to improve performance with this config setup
"pretrained_model_name_or_path": "stabilityai/stable-diffusion-2-1", "v2": true, "v_parameterization": true, "logging_dir": "D:\AI\References\IndianWoman\Training\log", "train_data_dir": "D:\AI\References\IndianWoman\Training\img", "reg_data_dir": "", "output_dir": "D:\AI\References\IndianWoman\Training\model", "max_resolution": "768,768", "learning_rate": ".005", "lr_scheduler": "constant", "lr_warmup": "0", "train_batch_size": 1, "epoch": "2", "save_every_n_epochs": "1", "mixed_precision": "bf16", "save_precision": "bf16", "seed": "42069", "num_cpu_threads_per_process": 2, "cache_latents": true, "caption_extension": ".txt", "enable_bucket": true, "gradient_checkpointing": false, "full_fp16": false, "no_token_padding": false, "stop_text_encoder_training": 0, "use_8bit_adam": true, "xformers": true, "save_model_as": "ckpt", "shuffle_caption": false, "save_state": false, "resume": "", "prior_loss_weight": 1.0, "color_aug": false, "flip_aug": false, "clip_skip": 2, "vae": "", "output_name": "IndianWoman", "max_token_length": "225", "max_train_epochs": "", "max_data_loader_n_workers": "1", "mem_eff_attn": false, "gradient_accumulation_steps": 1.0, "model_list": "stabilityai/stable-diffusion-2-1", "token_string": "IndiWoman", "init_word": "woman", "num_vectors_per_token": 8, "max_train_steps": "", "weights": "", "template": "caption", "keep_tokens": "0" }
I used 16 images at 100 repeats with this set up and had 1.07it/s on a 3090 with these settings.
Hope this helps anyone else
Same settings in the image besides the epochs for people that are more visual
@kohya-ss have you seen bmaltais/kohya_ss#36 which is about the long pauses between epochs?
I tried without xformers and it makes no difference in regards to this issue.
--persistent_data_loader_workers
will be solve the long pauses issue. Please try the option.
Root cause is pytorch/pytorch#12831. On windows the multiprocess dataloader is unusable. Extremely large overhead when starting each new worker.
Hence as @kohya-ss mentioned, the problem can be solved by either setting --persistent_data_loader_workers
to reduce the large overhead to only once at the start of training, or setting --max_data_loader_n_workers 0
to not trigger multiprocess dataloading.
This problem only occurs on windows. I wonder if that's the case with you @DarkAlchy
@BowenBao Not sure to be honest, but I moved on and away from embeddings into LoRA. I tried the workers Kohya suggest to no avail and gave up on this.
Can someone explain, for RTX4090 do I need enable "use xformers" or not?
xformers is not mandatory. If you have xformers enabled and training does not work, turn it off.
Over twice as slow using 512x512 and not Auto's 768x768. My gpu is barely being touched while it is 100% in Automatic1111.
edit: Same exact training in Automatic1111
TEN times slower with kohya_ss, but why?
Automatic while training Kohya_ss while training It is sitting there doing what I have no idea in comparison to Automatic1111 when looking at the CUDA hits.