Lora loading for bf16 and fp8, as separate models

andreasjansson commented 3 weeks ago

Had to fix some bugs in the original lora loading code.

Outputs are here: https://replicate.com/replicate-internal/test-flux-dev-lora

bf16 lora loading works
For fp8 lora loading, the lora strength is much lower and I need to crank up lora_scale to 1.7 to match lora_scale 1.1 at bf16

Fusing and unfusing is also slightly lossy so the model slowly degrades over time. We could do something like what peft and add a new node that does the matmul on the fly, instead of fusing. But that would slow down inference. Curious if you have ideas @daanelson

Averylamp commented 3 weeks ago

Very excited for this PR. Thanks for doing this! I was looking for an H100 lora inference provider and this seems like it would do the trick. I was curious as well if pricing for fast generations would be any different than per image because the GPU usage time is much less?

Averylamp commented 1 week ago

Hi, I was curious if this work is continuing on this PR? I believe this should make flux dev lora inference fast enough that you'd gain a customer versus using Fal as they currently are a few seconds faster (but from my benchmarking, this should be faster in the end). Happy to take on any tasks if you would like as well.

replicate / cog-flux

Lora loading for bf16 and fp8, as separate models #24