maszhongming / Multi-LoRA-Composition

Repository for the Paper "Multi-LoRA Composition for Image Generation"
425 stars 45 forks source link

Worse results? #7

Open catboxanon opened 5 months ago

catboxanon commented 5 months ago

Hi, I'm a maintainer of the Stable Diffusion webui. I tried implementing the composite method outlined in the paper and this repo, but it seems to produce worse results in all cases I've tested. See https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/15037#issuecomment-2002655419 for reference, which includes sample images and the code I used.

This also seems to introduce some major performance issues, as mentioned in https://github.com/maszhongming/Multi-LoRA-Composition/issues/6, but perhaps that's unavoidable?

maszhongming commented 5 months ago

Hi, I truly appreciate your input and the work you've put into integrating our methods. I'm not familiar with the Stable Diffusion webui codebase myself, but from what you've shared, the combine_denoised function appears to be implemented correctly.

Regarding the shared samples, do you think LoRA Merge is better because of the style? From my perspective, I think LoRA Composite's characters and poses align more closely with the image in Concept, but the style isn't significant enough; LoRA Merge has a better style, but there's a weird vertical bar on the right side of the curtain.

We have observed in both automatic and human evaluations that LoRA Merge shows superior performance when combining two LoRAs, especially when one of them is a "style" (the generated image shows distinct "style" features). However, when combining two other types of LoRAs (e.g., character + clothing), or when the number of LoRAs to be combined increases (3-5 LoRAs), it is significantly worse than our approach. Would you mind testing these scenarios and sharing your findings? I'm also willing to conduct tests with the same examples under the diffusers and peft codebases to pinpoint potential issues.

As for the inference speed, it's true that LoRA Composite demands more processing time. This increase should be linear, meaning combining k LoRAs would take roughly k times longer. Although I haven't come up with any methods to optimize it yet, the increase shouldn't be as drastic as you've mentioned, from a few seconds to over a minute.

By the way, have you had a chance to try the LoRA Switch method? Its inference time is on par with LoRA Merge, but our evaluations suggest it surpasses both LoRA Composite and Merge in terms of composition quality (Section 3.2 in our paper, observation 2).

catboxanon commented 5 months ago

Thank you for the in-depth reply!

I will try to take more time in conducting various tests with other LoRAs besides ones intended to influence style (e.g. character + clothing, as you mentioned). The degraded inference speed may be caused by some webui internals but I may look into that as well.

As for the Switch method -- this actually will be simpler to test, as an extension exists that would facilitate invoking this quite easily. I may even make a PR to extend the syntax with something that would make invoking the Switch method more simple. https://github.com/cheald/sd-webui-loractl

maszhongming commented 5 months ago

Thank you for your efforts and for exploring further tests!

Additionally, I strongly recommend testing with a combination of 3-5 LoRAs, as highlighted in our paper. With more LoRAs to be combined, vanilla LoRA merge will destabilize the generation process.

Regarding the degraded inference speed you've observed, while I'm not familiar with the webui's internals, a comparison with the implementation of diffusers in managing active adapters might shed some light on potential optimizations or differences.

Glad to hear the Switch method is simpler to test in your setup. Your work in extending the syntax for easier invocation is greatly appreciated.