Open sdbds opened 1 year ago
hmm.. just tried it on windows10 on 3090TI and I see slight improvements (ca. 1.43x).
torch 2.0.0+cu118, cuda 11.8, cudnn 8700
epoch 1/2
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
D:\AI\sd-scripts\venv\lib\site-packages\xformers\ops\fmha\flash.py:338: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
and inp.query.storage().data_ptr() == inp.key.storage().data_ptr()
steps: 50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 150/300 [03:11<03:11, 1.28s/it, loss=0.134]
epoch 2/2
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [06:07<00:00, 1.22s/it, loss=0.121]
torch 1.12.1+cu116, cuda 11.6, cudnn 8302:
epoch 1/2
steps: 50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 150/300 [04:28<04:28, 1.79s/it, loss=0.134]
epoch 2/2
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [08:45<00:00, 1.75s/it, loss=0.121]
If anyone wants to try for windows:
CD /D "D:\AI\sd-scripts"
git pull
.\venv\Scripts\activate
pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
pip install --use-pep517 --upgrade -r requirements.txt
pip install -U -I --no-deps https://files.pythonhosted.org/packages/d6/f7/02662286419a2652c899e2b3d1913c47723fc164b4ac06a85f769c291013/xformers-0.0.17rc482-cp310-cp310-win_amd64.whl
As you can see above, there is an error when using the new torch about triton module. But the Script/training still works. If you try to install triton, you'll get an error:
(venv) D:\AI\sd-scripts>pip install triton
ERROR: Could not find a version that satisfies the requirement triton (from versions: none)
ERROR: No matching distribution found for triton
So it looks like triton is not available for Windows. I guess one has to ignore the triton errors for now.
hmm.. just tried it on windows10 on 3090TI and I see slight improvements (ca. 1.43x).
torch 2.0.0+cu118, cuda 11.8, cudnn 8700
epoch 1/2 A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' D:\AI\sd-scripts\venv\lib\site-packages\xformers\ops\fmha\flash.py:338: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() and inp.query.storage().data_ptr() == inp.key.storage().data_ptr() steps: 50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 150/300 [03:11<03:11, 1.28s/it, loss=0.134] epoch 2/2 steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [06:07<00:00, 1.22s/it, loss=0.121]
torch 1.12.1+cu116, cuda 11.6, cudnn 8302:
epoch 1/2 steps: 50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 150/300 [04:28<04:28, 1.79s/it, loss=0.134] epoch 2/2 steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [08:45<00:00, 1.75s/it, loss=0.121]
If anyone wants to try for windows:
CD /D "D:\AI\sd-scripts" git pull .\venv\Scripts\activate pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 --extra-index-url https://download.pytorch.org/whl/cu118 pip install --use-pep517 --upgrade -r requirements.txt pip install -U -I --no-deps https://files.pythonhosted.org/packages/d6/f7/02662286419a2652c899e2b3d1913c47723fc164b4ac06a85f769c291013/xformers-0.0.17rc482-cp310-cp310-win_amd64.whl
As you can see above, there is an error when using the new torch about triton module. But the Script/training still works. If you try to install triton, you'll get an error:
(venv) D:\AI\sd-scripts>pip install triton ERROR: Could not find a version that satisfies the requirement triton (from versions: none) ERROR: No matching distribution found for triton
So it looks like triton is not available for Windows. I guess one has to ignore the triton errors for now.
recommend batch_size set Even number. i used 3070ti and adam8bit 、12000setps then 3it/s for highest speed.
recommend batch_size set Even number. i used 3070ti and adam8bit 、12000setps then 3it/s for highest speed.
So you had 1.5it/s on the same settings and the same dataset before using torch 2.0? To be clear, we really talking here about "it/s" and not "s/it"? (because Kohya's skript shows "s/it" when running. So the lower the number, the better. Also you have to test on the same dataset, settings and the same batch size, to be able to make any conclusions on speed. Because lower s/it value alone means nothing if different batch size was used.)
The Lora training test that i posted above was done with the same dataset (num train images * repeats / 学習画像の数×繰り返し回数: 1500) and train_batch_size=5. I also used DAdaptation for above test with:
optimizer_type = "DAdaptation"
resolution = "768,768"
cache_latents = true
enable_bucket = true
save_precision = "fp16"
save_every_n_epochs = 1
train_batch_size = 5
xformers = true
max_train_epochs = 2
max_data_loader_n_workers = 4
persistent_data_loader_workers = true
mixed_precision = "fp16"
learning_rate = 1.0
lr_scheduler = "cosine"
unet_lr = 1.0
text_encoder_lr = 1.0
network_module = "networks.lora"
network_dim = 128
network_alpha = 128.0
I tested the other batch-sizes on same dataset with torch 2.0.0+cu118 and the fastest test to finish was with batchsize 5:
train_batch_size=2 (DAdaptation)
steps: 50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 375/750 [04:37<04:37, 1.35it/s, loss=0.133]
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 750/750 [08:56<00:00, 1.40it/s, loss=0.133]
train_batch_size=3 (DAdaptation)
steps: 50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 253/506 [03:48<03:48, 1.11it/s, loss=0.134]
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 506/506 [07:24<00:00, 1.14it/s, loss=0.131]
train_batch_size=4 (DAdaptation)
steps: 50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 189/378 [03:24<03:24, 1.08s/it, loss=0.138]
epoch 2/2
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 378/378 [06:33<00:00, 1.04s/it, loss=0.119]
train_batch_size=5 (DAdaptation)
steps: 50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 150/300 [03:10<03:10, 1.27s/it, loss=0.134]
epoch 2/2
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [06:05<00:00, 1.22s/it, loss=0.121]
train_batch_size=6 (DAdaptation)
steps: 50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 128/256 [04:30<04:30, 2.11s/it, loss=0.135]
epoch 2/2
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 256/256 [08:39<00:00, 2.03s/it, loss=0.127]
With AdamW8bit (r.768,768#o.AdamW8bit#s.cosine#d.128#a.128#l.3e-4#u.3e-4#t.4.5e-5) and Batch size 5 the training runs a bit faster than DAdaptation. Though I stopped using AdamW8bit for LoRA training, since i get better results with DAdaptation. Looks like batch 5 is here the fastest as well:
train_batch_size=4 (AdamW8bit)
steps: 50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 189/378 [02:52<02:52, 1.10it/s, loss=0.137]
epoch 2/2
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 378/378 [05:28<00:00, 1.15it/s, loss=0.116]
train_batch_size=5 (AdamW8bit)
steps: 50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 150/300 [02:44<02:44, 1.09s/it, loss=0.132]
epoch 2/2
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [05:12<00:00, 1.04s/it, loss=0.116]
to be clear, it switches to s/it
(Seconds per iteration) when one iteration takes more than a second.
When one iteration takes less than a second, it switches to it/s
So when seeing s/it
your speed is very slow, and the higher the number, the worse. For example, 2s/it is actually 0.5it/s
When you see it/s
your speed is faster, and the higher the number, the better
recommend batch_size set Even number. i used 3070ti and adam8bit 、12000setps then 3it/s for highest speed.
So you had 1.5it/s on the same settings and the same dataset before using torch 2.0? To be clear, we really talking here about "it/s" and not "s/it"? (because Kohya's skript shows "s/it" when running. So the lower the number, the better. Also you have to test on the same dataset, settings and the same batch size, to be able to make any conclusions on speed. Because lower s/it value alone means nothing if different batch size was used.)
The Lora training test that i posted above was done with the same dataset (num train images * repeats / 学習画像の数×繰り返し回数: 1500) and train_batch_size=5. I also used DAdaptation for above test with:
optimizer_type = "DAdaptation" resolution = "768,768" cache_latents = true enable_bucket = true save_precision = "fp16" save_every_n_epochs = 1 train_batch_size = 5 xformers = true max_train_epochs = 2 max_data_loader_n_workers = 4 persistent_data_loader_workers = true mixed_precision = "fp16" learning_rate = 1.0 lr_scheduler = "cosine" unet_lr = 1.0 text_encoder_lr = 1.0 network_module = "networks.lora" network_dim = 128 network_alpha = 128.0
I tested the other batch-sizes on same dataset with torch 2.0.0+cu118 and the fastest test to finish was with batchsize 5:
train_batch_size=2 (DAdaptation)
steps: 50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 375/750 [04:37<04:37, 1.35it/s, loss=0.133] steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 750/750 [08:56<00:00, 1.40it/s, loss=0.133]
train_batch_size=3 (DAdaptation)
steps: 50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 253/506 [03:48<03:48, 1.11it/s, loss=0.134] steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 506/506 [07:24<00:00, 1.14it/s, loss=0.131]
train_batch_size=4 (DAdaptation)
steps: 50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 189/378 [03:24<03:24, 1.08s/it, loss=0.138] epoch 2/2 steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 378/378 [06:33<00:00, 1.04s/it, loss=0.119]
train_batch_size=5 (DAdaptation)
steps: 50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 150/300 [03:10<03:10, 1.27s/it, loss=0.134] epoch 2/2 steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [06:05<00:00, 1.22s/it, loss=0.121]
train_batch_size=6 (DAdaptation)
steps: 50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 128/256 [04:30<04:30, 2.11s/it, loss=0.135] epoch 2/2 steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 256/256 [08:39<00:00, 2.03s/it, loss=0.127]
With AdamW8bit (r.768,768#o.AdamW8bit#s.cosine#d.128#a.128#l.3e-4#u.3e-4#t.4.5e-5) and Batch size 5 the training runs a bit faster than DAdaptation. Though I stopped using AdamW8bit for LoRA training, since i get better results with DAdaptation. Looks like batch 5 is here the fastest as well:
train_batch_size=4 (AdamW8bit)
steps: 50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 189/378 [02:52<02:52, 1.10it/s, loss=0.137] epoch 2/2 steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 378/378 [05:28<00:00, 1.15it/s, loss=0.116]
train_batch_size=5 (AdamW8bit)
steps: 50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌ | 150/300 [02:44<02:44, 1.09s/it, loss=0.132] epoch 2/2 steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [05:12<00:00, 1.04s/it, loss=0.116]
Of course, I'm sure it's IT/S. However, the dataset I'm using is 512x512, which is faster than 768x768.
I recommend CUDA 12 and cuDNN 8800 as they will speed up by 10%. My friend tested it on a 4090 and also saw a doubling of speed.he use 1024X1024 and before 1.6s-2s/it=0.5-0.625it/s,after use lastest torch and xformers,he get 1.25it/s in same datesets.
It's worth noting that the maximum speed I observed here occurred at high epochs, such as around epoch 20. The maximum speed was achieved around epochs 5-10. Speed increases may not be as significant in low epoch situations.
to be clear, it switches to
s/it
(Seconds per iteration) when one iteration takes more than a second. When one iteration takes less than a second, it switches toit/s
So when seeing
s/it
your speed is very slow, and the higher the number, the worse. For example, 2s/it is actually 0.5it/sWhen you see
it/s
your speed is faster, and the higher the number, the better
oh.. Never saw it switching during my trainings, so I thought that "s/it" is always displayed by default... Also IMHO it should stick to one unit (preferably "it/s"), since the switching is just irritating (like this case demonstrates)
I recommend CUDA 12 and cuDNN 8800 as they will speed up by 10%. My friend tested it on a 4090 and also saw a doubling of speed.
I'll try it, thanks
EDIT: Tested it. See no real gains. Looks like only the owners of 4090 cards are are getting those crazy x2 speedups from torch 2 and new cuda. But the x1.4 speedup i got on my 3090 is not bad es well :)
to be clear, it switches to
s/it
(Seconds per iteration) when one iteration takes more than a second. When one iteration takes less than a second, it switches toit/s
So when seeings/it
your speed is very slow, and the higher the number, the worse. For example, 2s/it is actually 0.5it/s When you seeit/s
your speed is faster, and the higher the number, the betteroh.. Never saw it switching during my trainings, so I thought that "s/it" is always displayed by default... Also IMHO it should stick to one unit (preferably "it/s"), since the switching is just irritating (like this case demonstrates)
I recommend CUDA 12 and cuDNN 8800 as they will speed up by 10%. My friend tested it on a 4090 and also saw a doubling of speed.
I'll try it, thanks
EDIT: Tested it. See no real gains. Looks like only the owners of 4090 cards are are getting those crazy x2 speedups from torch 2 and new cuda. But the x1.4 speedup i got on my 3090 is not bad es well :)
you need to copy those cudXX.dll form cuda/bin and cudnn/bin to \venv\Lib\site-packages\torch\lib so they can works.
you need to copy those cudXX.dll form cuda/bin and cudnn/bin to \venv\Lib\site-packages\torch\lib so they can works.
I know, i did that. https://developer.download.nvidia.com/compute/redist/cudnn/v8.8.0/local_installers/ cudnn_adv_infer64_8.dll cudnn_adv_train64_8.dll cudnn_cnn_infer64_8.dll cudnn_cnn_train64_8.dll cudnn_ops_infer64_8.dll cudnn_ops_train64_8.dll cudnn64_8.dll
I hope someone else with 3090 can test and post his findings here.
@sdbds besides instalation shoudn't you also make changes in code like calling
torch.compile(model)
?
@sdbds besides instalation shoudn't you also make changes in code like calling
torch.compile(model)
? no,i just update version,it will be useful.
help me, epoch 1/2 A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' D:\kohya_ss\venv\lib\site-packages\xformers\ops\fmha\flash.py:339: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() and inp.query.storage().data_ptr() == inp.key.storage().data_ptr()
help me, epoch 1/2 A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' D:\kohya_ss\venv\lib\site-packages\xformers\ops\fmha\flash.py:339: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() and inp.query.storage().data_ptr() == inp.key.storage().data_ptr()
ignore it,it doesn't matter
i try to this and speed up to almost 2x