kohya-ss / sd-scripts

Apache License 2.0
5.09k stars 850 forks source link

Still very poor Rtx 4090 perfomance #834

Open etha302 opened 1 year ago

etha302 commented 1 year ago

As the title says, we are still experiencing very slow perfomance on rtx 4090 cards(and probably some other cards to) we would really like to find a solution to this, this is why i am opening this topic. Any discussion, tips etc... are welcome

etha302 commented 1 year ago

Which NVidia driver are you using?

There was a performance issue in versions after 531 for SD stuff although that may have been fixed at this point.

If you're on a driver version post 531 have you tried rolling back to an earlier version to see if that fixes it?

I'm also w/ a 4090 and don't have any speed issues right now.

(Re: https://old.reddit.com/r/StableDiffusion/comments/15mv5ju/psa_avoid_updating_to_nvidias_53699_drivers/)

Hi, i tried all the drivers i could possibly think off. My speed is currently on sdxl lora training 1.45s/it, which some people may consider fast. But it’s really not for a 4090, because 3090 gets around 1.2s/it easily. May i ask what speed are you getting?

DarkAlchy commented 1 year ago

See, this is where I have a lethally serious issue when people say they have no speed issues with a 4090 because they really do though they don't know it. This is the problem when you know things and most don't. Fact is a 4090 is 57% faster than a 3090 hardware for hardware BUT my friend's 3090 is slightly faster than my 4090.

For me if I go past 531.79 forget it. Nvidia has recognized the issue for a future update that affects ADA based cards. Has to do with their memory management stuff.

Then there is the speed issue with Kohya scripts itself that I showed. Not even going back through all that again as this subject is annoying me beyond belief because people don't know they have issues, or don't care because it is fast enough for them. Yeah, if I purchase a Bugatti if it does 100mph, that is fast but not what I paid for.

etha302 commented 1 year ago

See, this is where I have a lethally serious issue when people say they have no speed issues with a 4090 because they really do though they don't know it. This is the problem when you know things and most don't. Fact is a 4090 is 57% faster than a 3090 hardware for hardware BUT my friend's 3090 is slightly faster than my 4090.

For me if I go past 531.79 forget it. Nvidia has recognized the issue for a future update that affects ADA based cards. Has to do with their memory management stuff.

Then there is the speed issue with Kohya scripts itself that I showed. Not even going back through all that again as this subject is annoying me beyond belief because people don't know they have issues, or don't care because it is fast enough for them. Yeah, if I purchase a Bugatti if it does 100mph, that is fast but not what I paid for.

Nothing to add here! Exactly what you just said Edit: also if you train daily, like me and probably you as well, it gets really slow and leaving 57% on the table I’m definitely not having this.

DarkAlchy commented 1 year ago

See, this is where I have a lethally serious issue when people say they have no speed issues with a 4090 because they really do though they don't know it. This is the problem when you know things and most don't. Fact is a 4090 is 57% faster than a 3090 hardware for hardware BUT my friend's 3090 is slightly faster than my 4090. For me if I go past 531.79 forget it. Nvidia has recognized the issue for a future update that affects ADA based cards. Has to do with their memory management stuff. Then there is the speed issue with Kohya scripts itself that I showed. Not even going back through all that again as this subject is annoying me beyond belief because people don't know they have issues, or don't care because it is fast enough for them. Yeah, if I purchase a Bugatti if it does 100mph, that is fast but not what I paid for.

Nothing to add here! Exactly what you just said Edit: also if you train daily, like me and probably you as well, it gets really slow and leaving 57% on the table I’m definitely not having this.

I train for hours on end, so exactly. Pretty ticked I spent this amount of money to get the Jensen rub down.

etha302 commented 1 year ago

See, this is where I have a lethally serious issue when people say they have no speed issues with a 4090 because they really do though they don't know it. This is the problem when you know things and most don't. Fact is a 4090 is 57% faster than a 3090 hardware for hardware BUT my friend's 3090 is slightly faster than my 4090. For me if I go past 531.79 forget it. Nvidia has recognized the issue for a future update that affects ADA based cards. Has to do with their memory management stuff. Then there is the speed issue with Kohya scripts itself that I showed. Not even going back through all that again as this subject is annoying me beyond belief because people don't know they have issues, or don't care because it is fast enough for them. Yeah, if I purchase a Bugatti if it does 100mph, that is fast but not what I paid for.

Nothing to add here! Exactly what you just said Edit: also if you train daily, like me and probably you as well, it gets really slow and leaving 57% on the table I’m definitely not having this.

I train for hours on end, so exactly. Pretty ticked I spent this amount of money to get the Jensen rub down.

Same here. But i hope if enough people are actually aware of poor performance, maybe we can find a solution. I was already thinking about selling the 4090 and just getting 3090 instead, but I’ll wait and hopefully something happen, sadly I’m not experiencing enough with python or coding in general to help here. But I’ll definitely keep trying any tips, etc..

DarkAlchy commented 1 year ago

Same, but I will not sell the card just not happy being taken as we all were.

Wonderflex commented 1 year ago

Just throwing this comment out here so I can get replies in case we ever come to a solution on how to get things going faster. I rolled back to 531.79 and it moved me to 1.21s/it instead of the 1.13 it was doing with the latest drivers. This put me at 53 minutes instead of 56 minutes for 2600 steps, so not really a change worth noting.

I've been told this could have to do with rank being at 256, but I was just following the tutorial from Aitrepreneur that was yielding them good results. I'll try a lower rank sometime to see what the difference is in speed, but the video was saying this decreases quality while the file size is reduced.

Lesani commented 9 months ago

On my 4090 I get 2.26s/it which I think is very low compared to what I have seen around... No difference between different driver versions I tried, currently I am on the latest as of today Studio driver which is 546.33 on Win11

accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="F:/SDModels/Stable-diffusion/sdxl/sd_xl_base_1.0.safetensors" --train_data_dir="F:\stable-diffusion\kohya\training\subject_1\img" --reg_data_dir="F:\stable-diffusion\kohya\training\subject_1\reg" --resolution="1024,1024" --output_dir="F:\stable-diffusion\kohya\training\subject_1\model" --logging_dir="F:\stable-diffusion\kohya\training\subject_1\log" --network_alpha="2" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0001 --unet_lr=0.0001 --network_dim=32 --output_name="subject-32" --lr_scheduler_num_cycles="8" --no_half_vae --learning_rate="0.0001" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="7680" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --seed="1" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_grad_norm="1" --max_data_loader_n_workers="1" --bucket_reso_steps=64 --xformers --bucket_no_upscale --noise_offset=0.0 --sample_sampler=dpmsolver++ --sample_prompts="F:\stable-diffusion\kohya\training\subject_1\model\sample\prompt.txt" --sample_every_n_epochs="1"

Training is putting barely any load on the GPU With GPU load hovering at ~15% and spiking up to 50% every few seconds Memory controller load fluctuating between 1% and 15%

image

14:31:37-745940 INFO Start training LoRA Standard ... 14:31:37-747940 INFO Checking for duplicate image filenames in training data directory... 14:31:37-749941 INFO Valid image folder names found in: F:\stable-diffusion\kohya\training\subject_1\img 14:31:37-750940 INFO Valid image folder names found in: F:\stable-diffusion\kohya\training\subject_1\reg 14:31:37-751940 INFO Folder {subject folder}: 12 images found 14:31:37-752941 INFO Folder {subject folder}: 480 steps 14:31:37-753940 WARNING Regularisation images are used... Will double the number of steps required... 14:31:37-754940 INFO Total steps: 480 14:31:37-755944 INFO Train batch size: 1 14:31:37-755944 INFO Gradient accumulation steps: 1 14:31:37-756944 INFO Epoch: 8 14:31:37-757944 INFO Regulatization factor: 2 14:31:37-758944 INFO max_train_steps (480 / 1 / 1 * 8 * 2) = 7680 14:31:37-759944 INFO stop_text_encoder_training = 0 14:31:37-760943 INFO lr_warmup_steps = 0 steps: 25%|███████████████ | 1894/7680 [1:11:28<3:38:20, 2.26s/it, avr_loss=0.103]

DarkAlchy commented 9 months ago

@Lesani I can say you have a terrible bottleneck happening with your system as the 4090, even on my old B450 based 5600 sees far better than that. Your 4090 should be no less than 85% as a downward spike (shows it finished and is waiting for more data from the PC) to 100% (PC is waiting on the GPU).

What are your system specs?

daszzzpg commented 8 months ago

Any update guys? 4070Ti here, even cannot get past cache latent.. Impossible to train now

DarkAlchy commented 8 months ago

Any update guys? 4070Ti here, even cannot get past cache latent.. Impossible to train now

Yeah, we got sold a lemon while Jensen is laughing at us. I will say this that I am done with Nvidia until sanity returns, even if that means I can no longer do diffusion stuff. I have 8 years to go, so a lot can change between now and 2032 (as long as this 4090 works until then and I do have my doubts).

littleyeson commented 6 months ago

On my 4090 I get 2.26s/it which I think is very low compared to what I have seen around... No difference between different driver versions I tried, currently I am on the latest as of today Studio driver which is 546.33 on Win11

accelerate launch --num_cpu_threads_per_process=2 "./sdxl_train_network.py" --enable_bucket --min_bucket_reso=256 --max_bucket_reso=2048 --pretrained_model_name_or_path="F:/SDModels/Stable-diffusion/sdxl/sd_xl_base_1.0.safetensors" --train_data_dir="F:\stable-diffusion\kohya\training\subject_1\img" --reg_data_dir="F:\stable-diffusion\kohya\training\subject_1\reg" --resolution="1024,1024" --output_dir="F:\stable-diffusion\kohya\training\subject_1\model" --logging_dir="F:\stable-diffusion\kohya\training\subject_1\log" --network_alpha="2" --save_model_as=safetensors --network_module=networks.lora --text_encoder_lr=0.0001 --unet_lr=0.0001 --network_dim=32 --output_name="subject-32" --lr_scheduler_num_cycles="8" --no_half_vae --learning_rate="0.0001" --lr_scheduler="constant" --train_batch_size="1" --max_train_steps="7680" --save_every_n_epochs="1" --mixed_precision="bf16" --save_precision="bf16" --seed="1" --caption_extension=".txt" --cache_latents --cache_latents_to_disk --optimizer_type="Adafactor" --optimizer_args scale_parameter=False relative_step=False warmup_init=False --max_grad_norm="1" --max_data_loader_n_workers="1" --bucket_reso_steps=64 --xformers --bucket_no_upscale --noise_offset=0.0 --sample_sampler=dpmsolver++ --sample_prompts="F:\stable-diffusion\kohya\training\subject_1\model\sample\prompt.txt" --sample_every_n_epochs="1"

Training is putting barely any load on the GPU With GPU load hovering at ~15% and spiking up to 50% every few seconds Memory controller load fluctuating between 1% and 15%

image

14:31:37-745940 INFO Start training LoRA Standard ... 14:31:37-747940 INFO Checking for duplicate image filenames in training data directory... 14:31:37-749941 INFO Valid image folder names found in: F:\stable-diffusion\kohya\training\subject_1\img 14:31:37-750940 INFO Valid image folder names found in: F:\stable-diffusion\kohya\training\subject_1\reg 14:31:37-751940 INFO Folder {subject folder}: 12 images found 14:31:37-752941 INFO Folder {subject folder}: 480 steps 14:31:37-753940 WARNING Regularisation images are used... Will double the number of steps required... 14:31:37-754940 INFO Total steps: 480 14:31:37-755944 INFO Train batch size: 1 14:31:37-755944 INFO Gradient accumulation steps: 1 14:31:37-756944 INFO Epoch: 8 14:31:37-757944 INFO Regulatization factor: 2 14:31:37-758944 INFO max_train_steps (480 / 1 / 1 * 8 * 2) = 7680 14:31:37-759944 INFO stop_text_encoder_training = 0 14:31:37-760943 INFO lr_warmup_steps = 0 steps: 25%|███████████████ | 1894/7680 [1:11:28<3:38:20, 2.26s/it, avr_loss=0.103]

Do you fixed it?

stepahin commented 6 months ago

Hey any news here? I got only 2.50s/it on 4090, win11, butch size 5, xformers, gradient checkpoint on, bucketing on, default kohya settings :/

DarkAlchy commented 6 months ago

Nope. That is about all you are going to get with a 4090. A 5090 I really don't know if it will be faster or not. Hardware wise the 4090 should have been 35-57% faster in training but look what we received and the 5090 is supposedly this magical 1.5x faster than a 4090. Believe when I see it/s not s/it.

littleyeson commented 6 months ago

Have you resolved this issue? I'm experiencing the same problem,Train Slow for V100, and my system is running on Win11.Training is putting barely any load on the GPU,With GPU load hovering at 40% ~50%

stepahin commented 6 months ago

I would follow this thread as well. Only 2.50s/it on a 4090 batch size 5 :(

feffy380 commented 6 months ago

Have you already tried disabling the sysmem fallback?

DarkAlchy commented 6 months ago

Have you already tried disabling the sysmem fallback?

That has not one thing to do with this issue. Not a single thing. btw, it is now part of the global settings and is set to not use system memory because, when it did, 200 seconds, or more, per iteration. This is a known issue for 4090 that makes us shake our heads and want to gut punch Jensen since it should have been 57% faster than a 3090 due to hardware alone. Something isn't right, and I don't think they will make it right until they are secure enough to know that by fully utilizing the extra 57% will not cut into their business lineup of profits.

kohya-ss commented 6 months ago

I got about 1.7s/it for 1024x1024, xformers, batch size=5, U-Net and Text Encoder, dim(rank) 32, no conv dim. GPU utilization is little fluctuated but the average is around 85%. The driver version is 528.24 (I don't update after I believed the driver worked well.)

I'm not sure why GPU utilization remains low...

DarkAlchy commented 6 months ago

It is faster in Linux but that is due to how Linux works.

feffy380 commented 6 months ago

Have you already tried disabling the sysmem fallback?

That has not one thing to do with this issue. Not a single thing.

It's a well documented issue that slows down SD due to overly aggressive memory swapping and I was just trying to rule it out. But since you apparently know so much better have fun solving this on your own.

DarkAlchy commented 6 months ago

Have you already tried disabling the sysmem fallback?

That has not one thing to do with this issue. Not a single thing.

It's a well documented issue that slows down SD due to overly aggressive memory swapping and I was just trying to rule it out. But since you apparently know so much better have fun solving this on your own.

I think that was brought up previously, which is why I said what I said. I see so many times people respond to threads on the internet without taking the time it requires to see if what they are about to mention was already discounted. Fact is this is life and this is as fast as we are going to be allowed to see and there is absolutely nothing we can do, unless you hack the Nvidia drivers or manage to get Jensen to open it up, to change our speed. Hardware wise, the 4090 is way under utilized.

As to your second response, you see, I actually do know, and I have said on this thread numerous times while people are still chasing the Dragon's tail. The best we can hope for is that the drivers come along to allow us to use 100% of the hardware the 4090 has but that will either be via a hack, or Nvidia themselves.

If you have a 4090 turn that on and off. Now watch the its when it flows to system ram. We aren't talking 1-2 s more per it we are talking magnitudes slower (100/200+ seconds per it). Part of the slowness issues, as I mentioned before, is Windows. That accounts for about 10-20% so if you want to speed it up some use Linux. The card is still limited, but the OS is not limiting it even more.

AiEzra commented 6 months ago

Same issue here, 4.61s/it training in Kohya_SS. NVIDIA DRIVER = Game Ready Driver 551.86

SYSTEM SPEC: RTX 4090 FE Ryzen 9 3900X MSI B450 MOBO 128GB DDR4 3600mt/s 1TB NVME BOOT 4TB NVME STORAGE

Not the performance I hoped to get from such an expensive card.

I get much faster results in Stable Diffusion generating images, but in Kohya_ss training its so slow!

DarkAlchy commented 6 months ago

Same issue here, 4.61s/it training in Kohya_SS. NVIDIA DRIVER = Game Ready Driver 551.86

SYSTEM SPEC: RTX 4090 FE Ryzen 9 3900X MSI B450 MOBO 128GB DDR4 3600mt/s 1TB NVME BOOT 4TB NVME STORAGE

Not the performance I hoped to get from such an expensive card.

I get much faster results in Stable Diffusion generating images, but in Kohya_ss training its so slow!

Don't blame Kohya though and training is always way slower than generating no matter what. I agree, we got ripped off, but if this will help you feel any better I had a 3090 ti use my .json yesterday. Their speed was ~6.25s/it my speed about 4.28s/it and both are on Windows.

AiEzra commented 6 months ago

Yeah that's a good point.

I'm in Stable Diffusion now generating images form my LORA, I'm getting 7.96it/s average which is better. Although I read online of people getting 40+ it/s on an RTX 4090 which leads me to believe I'm doing something wrong.

What sort of speeds are you guys getting?

Thanks for the words of encouragement though @DarkAlchy!

AiEzra commented 6 months ago

Quick update, I've just set Kohya_SS off to train overnight and.. well, it's training significantly faster than last time..

BEFORE = 4.61s/it AFTER = 1.78it/s

This is a significantly faster result, I'm using the exact same setting (double checked), and the same amount of images at the same resolution (1024,1024), everything is identical!

Really strange behavior, if I get to the bottom of this performance increase I'll update this post below.

DarkAlchy commented 6 months ago

On W10, with a 5600 3200mhz DD4 48GB at 7.5ish it/s in comfy and 1it/s slower in a1111.

kohya-ss commented 6 months ago

AFTER = 1.78it/s

It's very similar to mine. I don't think I have made any updates recently that would affect the training speed. I'm sure there is a cause somewhere, but it is very strange.

DarkAlchy commented 6 months ago

AFTER = 1.78it/s

It's very similar to mine. I don't think I have made any updates recently that would affect the training speed. I'm sure there is a cause somewhere, but it is very strange.

It is in the pipe across all trainers. Some of this is we are starving the 4090 (reason Linux is a bit faster). I know this because Intel CPU will be faster gens than AMD Ryzen CPUs because Python is not true multicore (GIL). Python is doing away with GIL but that is another 4 to 5 years per the Python org. When that happens AMD will be king because AMD is faster in multithreading whereas Intel is faster in single core. Since Python is single core only (you can watch it dance around the cores with HWInfo etc...) as it gens/trains Intel is the best CPU to pair with a 4090 but it really isn't enough to warrant all Intel for me at least.

AiEzra commented 6 months ago

On W10, with a 5600 3200mhz DD4 48GB at 7.5ish it/s in comfy and 1it/s slower in a1111.

Wow that's a really great speed for your hardware, congrats!

I wonder why I'm getting almost the same image generating speeds in Stable Diffusion as you despite having a significantly more powerful computer. Any guesses?

Thanks for sharing that info, much appreciated!

AiEzra commented 6 months ago

AFTER = 1.78it/s

It's very similar to mine. I don't think I have made any updates recently that would affect the training speed. I'm sure there is a cause somewhere, but it is very strange.

Thanks for the reply!

I found one change I made which MUST have been causing this drastic difference in speed which I stated before, and that was the number of repeats I was using whilst training: 100 repeats = 4.61s/it 20 repeats = 1.78it/s

Does this make any sense to you? Is there any logic behind this? My system seems to be being stressed the same either way, but one is much faster.

Does it make sense that a smaller amounts of repeats = faster it/s?

AiEzra commented 6 months ago

AFTER = 1.78it/s

It's very similar to mine. I don't think I have made any updates recently that would affect the training speed. I'm sure there is a cause somewhere, but it is very strange.

It is in the pipe across all trainers. Some of this is we are starving the 4090 (reason Linux is a bit faster). I know this because Intel CPU will be faster gens than AMD Ryzen CPUs because Python is not true multicore (GIL). Python is doing away with GIL but that is another 4 to 5 years per the Python org. When that happens AMD will be king because AMD is faster in multithreading whereas Intel is faster in single core. Since Python is single core only (you can watch it dance around the cores with HWInfo etc...) as it gens/trains Intel is the best CPU to pair with a 4090 but it really isn't enough to warrant all Intel for me at least.

Oooh interesting, so a top of the food chain fast Intel CPU would result in faster generations and training compared to AMD, at least for now. Cool, thanks for the info!

AiEzra commented 6 months ago

I'm thinking of duel booting my system with Linux to help speed things up a bit, has anyone here had any experience with doing this / can share some results?

I'll do some research and update this thread once I've some concrete evidence.

DarkAlchy commented 6 months ago

I'm thinking of duel booting my system with Linux to help speed things up a bit, has anyone here had any experience with doing this / can share some results?

I'll do some research and update this thread once I've some concrete evidence.

I went to Zorin OS and before was Ubuntu. Zorin is just better in many respects (way less bloat) and is based on Ubuntu. 20m train W10 was about 15-17m in Linux.

AiEzra commented 6 months ago

I'm thinking of duel booting my system with Linux to help speed things up a bit, has anyone here had any experience with doing this / can share some results? I'll do some research and update this thread once I've some concrete evidence.

I went to Zorin OS and before was Ubuntu. Zorin is just better in many respects (way less bloat) and is based on Ubuntu. 20m train W10 was about 15-17m in Linux.

Awesome, thanks for the info - really helpful!