lllyasviel / stable-diffusion-webui-forge

GNU Affero General Public License v3.0
8.09k stars 792 forks source link

Much longer generation time on a 2nd time #1634

Closed SeBL4RD closed 1 month ago

SeBL4RD commented 1 month ago

I don't understand why, I imagine it's perhaps due to a bad ram/Vram purge? But when I use flux1-dev-Q8_0.gguf + t5-v1_1-xxl-encoder-Q8_0.gguf, the first generation takes about 3.75s/it, but when with the same settings, without changing anything, I try a 2nd generation, I go to 15.35s/it....

It's a pity, because the 1st generation, in 1920x1080, has good results. And is very close to Flux1dev + txxl fp16.

3700x, 32gigs ram, 3080ti, 970evo+, Windows 10. Latest version of Forge.

blakejrobinson commented 1 month ago

You can see your VRAM usage in the Task Manager (performance tab). That'll show if it's related to running out of ram

SeBL4RD commented 1 month ago

You can see your VRAM usage in the Task Manager (performance tab). That'll show if it's related to running out of ram

I constantly monitor my Ram/Vram, and nothing changes between the 2, consumption remains the same, and I have half my ram free as you can see. Screenshot_257

SeBL4RD commented 1 month ago

I can't use the Hires Fix because of this, because to go from 1152x896 > x1.55 = 1785x1388, it takes 35s/it, whereas when it doesn't bug I can generate in 1080p at 3.7s/it... its a non sense

blakejrobinson commented 1 month ago

Does that say 13gb VRAM used? If so, that looks like far too little VRAM to have the Q8 flux, Q8 T5, Clip and VAE loaded in VRAM. Mine, for example, takes up 18-19GB.

If Forge detects you don't have enough VRAM, it will swap things in and out of it which takes time.

SeBL4RD commented 1 month ago

Sometimes after one or more forced generation stops, it/s become normal again ... ???

Screenshot_258

blakejrobinson commented 1 month ago

How much VRAM does the 3080ti have? Flux Q8 and T5 Q8 need around 18GB free or it will swap things back and forth between VRAM/RAM between generations which can slow things. You might need the NF4 if you have less VRAM than that for consistent speed/memory.

SeBL4RD commented 1 month ago

How much VRAM does the 3080ti have? Flux Q8 and T5 Q8 need around 18GB free or it will swap things back and forth between VRAM/RAM between generations which can slow things. You might need the NF4 if you have less VRAM than that for consistent speed/memory.

12GB, but then why does it sometimes work so well and sometimes not?

SeBL4RD commented 1 month ago

It's exactly the same with Q4_K_S and t5 Q4_K_S, there's got to be a problem somewhere.

wardensc2 commented 1 month ago

Forge does not free all the VRAM or use the freed VRAM after releasing it after 1st generation that why, I'm also try experience this with the latest update of Forge. Due to lag of VRAM the second generation will be out of memory and use RAM to replace which make gen speed 3-4 times slower.

lllyasviel commented 1 month ago

Hi do you have full console logs

SeBL4RD commented 1 month ago

Hi do you have full console logs

I could make you one. I'll do it as soon as possible.

SeBL4RD commented 1 month ago

Hi do you have full console logs

Full log : https://pastebin.com/JaF3wGa5

As you can see, the 1st image generate normally (3,6s/it) Second one is slower, and stabilize on 9s/it, I interrupt. 3rd image come back to 3,6s/it

Etc Screenshot_259

lllyasviel commented 1 month ago
  1. Are you using a PC with both HDD and SSD?
  2. what will happen if you do not use lora
SeBL4RD commented 1 month ago
  1. Are you using a PC with both HDD and SSD?
  2. what will happen if you do not use lora

Yeah, i have 3 SSDs and 2 HDD, 970 evo+ 1TB, 860 Evo 500 GB, 870 evo 1 TB, Barracuda 1 TB, WD blue 2 TB. I don't do sh*t like pagefile.sys and others paginations on HDD, if its the question. Only C (SSD) have.

I will try without LoRA

lllyasviel commented 1 month ago

I added some possible fix, update, try again, and put full console log, if possible without lora

SeBL4RD commented 1 month ago

I added some possible fix, update, try again, and put full console log, if possible without lora

(Before your update) : Hmm, actually, without LoRA, I don't have this problem. I've done 5 in a row at 3.6s/it.

I will update and tell you

lllyasviel commented 1 month ago

By the way another person is solve https://github.com/lllyasviel/stable-diffusion-webui-forge/issues/1630 hopefully your side can also get some luck

lllyasviel commented 1 month ago

Oh wait if you do not have problem when not using loras then it is completely different problem.

You should try lower "GPU weights" a bit, try again, and you will be able to find a value for that lora that works in 100% cases, and then tell me the number that works

SeBL4RD commented 1 month ago

Oh wait if you do not have problem when not using loras then it is completely different problem.

You should try lower "GPU weights" a bit, try again, and you will be able to find a value for that lora that works in 100% cases, and then tell me the number that works

It seems to have worked. I know my LoRA is less than 200 MB, I removed 200 MB and it's fine, I get 3.6s/it all the time. Thanks :) !

lllyasviel commented 1 month ago

so the number is 200MB?

SeBL4RD commented 1 month ago

so the number is 200MB?

I can only say that in my case, it worked x)

lllyasviel commented 1 month ago

update and try again, this time you should not need to drop that 200MB

SeBL4RD commented 1 month ago

update and try again, this time you should not need to drop that 200MB

seems working well

lllyasviel commented 1 month ago

Good

SeBL4RD commented 1 month ago

Good

Thanks again! I'll be able to do without nf4 :)