kohya-ss / sd-scripts

Apache License 2.0
4.51k stars 764 forks source link

[Enhancement] install torch==2.0.0+cu118 torchvision==0.15.1+cu118 xformers==0.0.17rc482 #326

Open sdbds opened 1 year ago

sdbds commented 1 year ago

i try to this and speed up to almost 2x

neojam commented 1 year ago

hmm.. just tried it on windows10 on 3090TI and I see slight improvements (ca. 1.43x).

torch 2.0.0+cu118, cuda 11.8, cudnn 8700

epoch 1/2
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
D:\AI\sd-scripts\venv\lib\site-packages\xformers\ops\fmha\flash.py:338: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  and inp.query.storage().data_ptr() == inp.key.storage().data_ptr()
steps:  50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                                        | 150/300 [03:11<03:11,  1.28s/it, loss=0.134]
epoch 2/2
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [06:07<00:00,  1.22s/it, loss=0.121]

torch 1.12.1+cu116, cuda 11.6, cudnn 8302:

epoch 1/2
steps:  50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                                        | 150/300 [04:28<04:28,  1.79s/it, loss=0.134]
epoch 2/2
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [08:45<00:00,  1.75s/it, loss=0.121]

If anyone wants to try for windows:

CD /D "D:\AI\sd-scripts"
git pull

.\venv\Scripts\activate

pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
pip install --use-pep517 --upgrade -r requirements.txt
pip install -U -I --no-deps https://files.pythonhosted.org/packages/d6/f7/02662286419a2652c899e2b3d1913c47723fc164b4ac06a85f769c291013/xformers-0.0.17rc482-cp310-cp310-win_amd64.whl

As you can see above, there is an error when using the new torch about triton module. But the Script/training still works. If you try to install triton, you'll get an error:

(venv) D:\AI\sd-scripts>pip install triton
ERROR: Could not find a version that satisfies the requirement triton (from versions: none)
ERROR: No matching distribution found for triton

So it looks like triton is not available for Windows. I guess one has to ignore the triton errors for now.

sdbds commented 1 year ago

hmm.. just tried it on windows10 on 3090TI and I see slight improvements (ca. 1.43x).

torch 2.0.0+cu118, cuda 11.8, cudnn 8700

epoch 1/2
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
A matching Triton is not available, some optimizations will not be enabled.
Error caught was: No module named 'triton'
D:\AI\sd-scripts\venv\lib\site-packages\xformers\ops\fmha\flash.py:338: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly.  To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
  and inp.query.storage().data_ptr() == inp.key.storage().data_ptr()
steps:  50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                                        | 150/300 [03:11<03:11,  1.28s/it, loss=0.134]
epoch 2/2
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [06:07<00:00,  1.22s/it, loss=0.121]

torch 1.12.1+cu116, cuda 11.6, cudnn 8302:

epoch 1/2
steps:  50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                                        | 150/300 [04:28<04:28,  1.79s/it, loss=0.134]
epoch 2/2
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [08:45<00:00,  1.75s/it, loss=0.121]

If anyone wants to try for windows:

CD /D "D:\AI\sd-scripts"
git pull

.\venv\Scripts\activate

pip install torch==2.0.0+cu118 torchvision==0.15.1+cu118 --extra-index-url https://download.pytorch.org/whl/cu118
pip install --use-pep517 --upgrade -r requirements.txt
pip install -U -I --no-deps https://files.pythonhosted.org/packages/d6/f7/02662286419a2652c899e2b3d1913c47723fc164b4ac06a85f769c291013/xformers-0.0.17rc482-cp310-cp310-win_amd64.whl

As you can see above, there is an error when using the new torch about triton module. But the Script/training still works. If you try to install triton, you'll get an error:

(venv) D:\AI\sd-scripts>pip install triton
ERROR: Could not find a version that satisfies the requirement triton (from versions: none)
ERROR: No matching distribution found for triton

So it looks like triton is not available for Windows. I guess one has to ignore the triton errors for now.

recommend batch_size set Even number. i used 3070ti and adam8bit 、12000setps then 3it/s for highest speed.

neojam commented 1 year ago

recommend batch_size set Even number. i used 3070ti and adam8bit 、12000setps then 3it/s for highest speed.

So you had 1.5it/s on the same settings and the same dataset before using torch 2.0? To be clear, we really talking here about "it/s" and not "s/it"? (because Kohya's skript shows "s/it" when running. So the lower the number, the better. Also you have to test on the same dataset, settings and the same batch size, to be able to make any conclusions on speed. Because lower s/it value alone means nothing if different batch size was used.)

The Lora training test that i posted above was done with the same dataset (num train images * repeats / 学習画像の数×繰り返し回数: 1500) and train_batch_size=5. I also used DAdaptation for above test with:

optimizer_type = "DAdaptation"
resolution = "768,768"
cache_latents = true
enable_bucket = true
save_precision = "fp16"
save_every_n_epochs = 1
train_batch_size = 5
xformers = true
max_train_epochs = 2
max_data_loader_n_workers = 4
persistent_data_loader_workers = true
mixed_precision = "fp16"
learning_rate = 1.0
lr_scheduler = "cosine"
unet_lr = 1.0
text_encoder_lr = 1.0
network_module = "networks.lora"
network_dim = 128
network_alpha = 128.0

I tested the other batch-sizes on same dataset with torch 2.0.0+cu118 and the fastest test to finish was with batchsize 5:

train_batch_size=2 (DAdaptation)

steps:  50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                                        | 375/750 [04:37<04:37,  1.35it/s, loss=0.133]
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 750/750 [08:56<00:00,  1.40it/s, loss=0.133]

train_batch_size=3 (DAdaptation)

steps:  50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                                        | 253/506 [03:48<03:48,  1.11it/s, loss=0.134]
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 506/506 [07:24<00:00,  1.14it/s, loss=0.131]

train_batch_size=4 (DAdaptation)

steps:  50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                                        | 189/378 [03:24<03:24,  1.08s/it, loss=0.138]
epoch 2/2
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 378/378 [06:33<00:00,  1.04s/it, loss=0.119]

train_batch_size=5 (DAdaptation)

steps:  50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                                        | 150/300 [03:10<03:10,  1.27s/it, loss=0.134]
epoch 2/2
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [06:05<00:00,  1.22s/it, loss=0.121]

train_batch_size=6 (DAdaptation)

steps:  50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                                        | 128/256 [04:30<04:30,  2.11s/it, loss=0.135]
epoch 2/2
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 256/256 [08:39<00:00,  2.03s/it, loss=0.127]

With AdamW8bit (r.768,768#o.AdamW8bit#s.cosine#d.128#a.128#l.3e-4#u.3e-4#t.4.5e-5) and Batch size 5 the training runs a bit faster than DAdaptation. Though I stopped using AdamW8bit for LoRA training, since i get better results with DAdaptation. Looks like batch 5 is here the fastest as well:

train_batch_size=4 (AdamW8bit)

steps:  50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                                        | 189/378 [02:52<02:52,  1.10it/s, loss=0.137]
epoch 2/2
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 378/378 [05:28<00:00,  1.15it/s, loss=0.116]

train_batch_size=5 (AdamW8bit)

steps:  50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                                        | 150/300 [02:44<02:44,  1.09s/it, loss=0.132]
epoch 2/2
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [05:12<00:00,  1.04s/it, loss=0.116]
ghost commented 1 year ago

to be clear, it switches to s/it (Seconds per iteration) when one iteration takes more than a second. When one iteration takes less than a second, it switches to it/s

So when seeing s/it your speed is very slow, and the higher the number, the worse. For example, 2s/it is actually 0.5it/s

When you see it/s your speed is faster, and the higher the number, the better

sdbds commented 1 year ago

recommend batch_size set Even number. i used 3070ti and adam8bit 、12000setps then 3it/s for highest speed.

So you had 1.5it/s on the same settings and the same dataset before using torch 2.0? To be clear, we really talking here about "it/s" and not "s/it"? (because Kohya's skript shows "s/it" when running. So the lower the number, the better. Also you have to test on the same dataset, settings and the same batch size, to be able to make any conclusions on speed. Because lower s/it value alone means nothing if different batch size was used.)

The Lora training test that i posted above was done with the same dataset (num train images * repeats / 学習画像の数×繰り返し回数: 1500) and train_batch_size=5. I also used DAdaptation for above test with:

optimizer_type = "DAdaptation"
resolution = "768,768"
cache_latents = true
enable_bucket = true
save_precision = "fp16"
save_every_n_epochs = 1
train_batch_size = 5
xformers = true
max_train_epochs = 2
max_data_loader_n_workers = 4
persistent_data_loader_workers = true
mixed_precision = "fp16"
learning_rate = 1.0
lr_scheduler = "cosine"
unet_lr = 1.0
text_encoder_lr = 1.0
network_module = "networks.lora"
network_dim = 128
network_alpha = 128.0

I tested the other batch-sizes on same dataset with torch 2.0.0+cu118 and the fastest test to finish was with batchsize 5:

train_batch_size=2 (DAdaptation)

steps:  50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                                        | 375/750 [04:37<04:37,  1.35it/s, loss=0.133]
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 750/750 [08:56<00:00,  1.40it/s, loss=0.133]

train_batch_size=3 (DAdaptation)

steps:  50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                                        | 253/506 [03:48<03:48,  1.11it/s, loss=0.134]
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 506/506 [07:24<00:00,  1.14it/s, loss=0.131]

train_batch_size=4 (DAdaptation)

steps:  50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                                        | 189/378 [03:24<03:24,  1.08s/it, loss=0.138]
epoch 2/2
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 378/378 [06:33<00:00,  1.04s/it, loss=0.119]

train_batch_size=5 (DAdaptation)

steps:  50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                                        | 150/300 [03:10<03:10,  1.27s/it, loss=0.134]
epoch 2/2
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [06:05<00:00,  1.22s/it, loss=0.121]

train_batch_size=6 (DAdaptation)

steps:  50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                                        | 128/256 [04:30<04:30,  2.11s/it, loss=0.135]
epoch 2/2
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 256/256 [08:39<00:00,  2.03s/it, loss=0.127]

With AdamW8bit (r.768,768#o.AdamW8bit#s.cosine#d.128#a.128#l.3e-4#u.3e-4#t.4.5e-5) and Batch size 5 the training runs a bit faster than DAdaptation. Though I stopped using AdamW8bit for LoRA training, since i get better results with DAdaptation. Looks like batch 5 is here the fastest as well:

train_batch_size=4 (AdamW8bit)

steps:  50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                                        | 189/378 [02:52<02:52,  1.10it/s, loss=0.137]
epoch 2/2
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 378/378 [05:28<00:00,  1.15it/s, loss=0.116]

train_batch_size=5 (AdamW8bit)

steps:  50%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▌                                                                                                                        | 150/300 [02:44<02:44,  1.09s/it, loss=0.132]
epoch 2/2
steps: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [05:12<00:00,  1.04s/it, loss=0.116]

Of course, I'm sure it's IT/S. However, the dataset I'm using is 512x512, which is faster than 768x768.

I recommend CUDA 12 and cuDNN 8800 as they will speed up by 10%. My friend tested it on a 4090 and also saw a doubling of speed.he use 1024X1024 and before 1.6s-2s/it=0.5-0.625it/s,after use lastest torch and xformers,he get 1.25it/s in same datesets.

It's worth noting that the maximum speed I observed here occurred at high epochs, such as around epoch 20. The maximum speed was achieved around epochs 5-10. Speed increases may not be as significant in low epoch situations.

neojam commented 1 year ago

to be clear, it switches to s/it (Seconds per iteration) when one iteration takes more than a second. When one iteration takes less than a second, it switches to it/s

So when seeing s/it your speed is very slow, and the higher the number, the worse. For example, 2s/it is actually 0.5it/s

When you see it/s your speed is faster, and the higher the number, the better

oh.. Never saw it switching during my trainings, so I thought that "s/it" is always displayed by default... Also IMHO it should stick to one unit (preferably "it/s"), since the switching is just irritating (like this case demonstrates)

I recommend CUDA 12 and cuDNN 8800 as they will speed up by 10%. My friend tested it on a 4090 and also saw a doubling of speed.

I'll try it, thanks

EDIT: Tested it. See no real gains. Looks like only the owners of 4090 cards are are getting those crazy x2 speedups from torch 2 and new cuda. But the x1.4 speedup i got on my 3090 is not bad es well :)

sdbds commented 1 year ago

to be clear, it switches to s/it (Seconds per iteration) when one iteration takes more than a second. When one iteration takes less than a second, it switches to it/s So when seeing s/it your speed is very slow, and the higher the number, the worse. For example, 2s/it is actually 0.5it/s When you see it/s your speed is faster, and the higher the number, the better

oh.. Never saw it switching during my trainings, so I thought that "s/it" is always displayed by default... Also IMHO it should stick to one unit (preferably "it/s"), since the switching is just irritating (like this case demonstrates)

I recommend CUDA 12 and cuDNN 8800 as they will speed up by 10%. My friend tested it on a 4090 and also saw a doubling of speed.

I'll try it, thanks

EDIT: Tested it. See no real gains. Looks like only the owners of 4090 cards are are getting those crazy x2 speedups from torch 2 and new cuda. But the x1.4 speedup i got on my 3090 is not bad es well :)

you need to copy those cudXX.dll form cuda/bin and cudnn/bin to \venv\Lib\site-packages\torch\lib so they can works.

neojam commented 1 year ago

you need to copy those cudXX.dll form cuda/bin and cudnn/bin to \venv\Lib\site-packages\torch\lib so they can works.

I know, i did that. https://developer.download.nvidia.com/compute/redist/cudnn/v8.8.0/local_installers/ cudnn_adv_infer64_8.dll cudnn_adv_train64_8.dll cudnn_cnn_infer64_8.dll cudnn_cnn_train64_8.dll cudnn_ops_infer64_8.dll cudnn_ops_train64_8.dll cudnn64_8.dll

I hope someone else with 3090 can test and post his findings here.

kgonia commented 1 year ago

@sdbds besides instalation shoudn't you also make changes in code like calling torch.compile(model) ?

sdbds commented 1 year ago

@sdbds besides instalation shoudn't you also make changes in code like calling torch.compile(model) ? no,i just update version,it will be useful.

buhtigntt commented 1 year ago

help me, epoch 1/2 A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' D:\kohya_ss\venv\lib\site-packages\xformers\ops\fmha\flash.py:339: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() and inp.query.storage().data_ptr() == inp.key.storage().data_ptr()

sdbds commented 1 year ago

help me, epoch 1/2 A matching Triton is not available, some optimizations will not be enabled. Error caught was: No module named 'triton' D:\kohya_ss\venv\lib\site-packages\xformers\ops\fmha\flash.py:339: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage() and inp.query.storage().data_ptr() == inp.key.storage().data_ptr()

ignore it,it doesn't matter