Open shomerYu opened 2 days ago
The node should have choice for load device, it's slow on CPU but should not take long at all on GPU (main_device), it's like 1-2 seconds for me on 4090.
It's not necessary to fuse it unless you want to also use torch.compile, in that case you'd set the strength to (1 / rank), which would be 0.0039
Hi!
I'm using you workflow with lora dimension (orbit left). The Merging rank 256 LoRA weights step takes forever (i'm on A100) is there a way to speed up the process?