Open dummyapps opened 1 day ago
On what GPU?
FasterCache may be the culprit as it makes compiling more complex due to the input changing during sampling.
Just ran a test on a 4090 after all these recent updates to compare, using 1.5 I2V, sdpa, no FasterCache: Sampling 53 frames in 13 latent frames at 720x480 with 25 inference steps
Uncompiled:
100%|██████████████████████████████████████████████████████████████████████████████████| 25/25 [01:01<00:00, 2.47s/it]
Allocated memory: memory=0.011 GB
Max allocated memory: max_memory=11.855 GB
Max reserved memory: max_reserved=12.625 GB
Prompt executed in 72.98 seconds
Compiled:
Sampling 53 frames in 13 latent frames at 720x480 with 25 inference steps
100%|██████████████████████████████████████████████████████████████████████████████████| 25/25 [00:50<00:00, 2.00s/it]
Allocated memory: memory=0.011 GB
Max allocated memory: max_memory=11.209 GB
Max reserved memory: max_reserved=11.719 GB
Prompt executed in 61.90 seconds
So sampling speed increase is about ~20%
Kijai,you are right. faster cache is the culprit.
My gpu is 4060ti 16G if faster cache is disabled, the speed increases about 20% when enable triton, from 10.x to 7.x if faster cache is enabled, the speed has no difference, no matter triton is on or not.
It seems that faster cache is better, it is faster about 200%, and has better video output for me.
I need to look into getting it to work with FasterCache, there's probably some way.. I think I had it working at one point, it requires increasing the cache size limit a lot and the compilation becomes really slow though.
Look forward to your progress.
KiJai,do you have Alipay or wechat? I would like to donate to your project, you contribute really great projects, but I have no PayPal account.
Not to jump in and sidetrack but just an FYI there are rumors that using the fastcache (I use it too it shaves 2+min off the gens!) forces fp8 upon some of the effects it later calculates and ruins the videos.
I'm still researching this but from the looks of this it's a double edge sword. It does make it faster but does something to break fp16/fp32 mathematics at some point and people have been saying to not use it for full quality and eat the extra time instead.
Not sure if this is a good time vs effort spent due to this. I'm still deep into researching this as everything is brand new to all of us and nvidia's driver changes actually modify the way this is handled too.
Just a heads up in case you have better places to spend your coding time! will update if I find anything crazy about this
Not to jump in and sidetrack but just an FYI there are rumors that using the fastcache (I use it too it shaves 2+min off the gens!) forces fp8 upon some of the effects it later calculates and ruins the videos.
I'm still researching this but from the looks of this it's a double edge sword. It does make it faster but does something to break fp16/fp32 mathematics at some point and people have been saying to not use it for full quality and eat the extra time instead.
Not sure if this is a good time vs effort spent due to this. I'm still deep into researching this as everything is brand new to all of us and nvidia's driver changes actually modify the way this is handled too.
Just a heads up in case you have better places to spend your coding time! will update if I find anything crazy about this
Rurmors? That's simply untrue, it does not change the precision. It caches it at the main precision you have selected. Quality loss is naturally from using cached hidden_states instead of actually calculating it, which is also where the speed increase comes and why it uses more memory.
Oh is that what it is hmm well this is different than I read. Thank you for the extra info, I'll research this!
I saw it compiled, it can increase 20% performance on flux, but it seems that it has no effect on cogvideo 1.5 the quantization is fp8, faster cache is enabled