hotshotco / Hotshot-XL

✨ Hotshot-XL: State-of-the-art AI text-to-GIF model trained to work alongside Stable Diffusion XL
https://hotshot.co
Apache License 2.0
982 stars 77 forks source link

How to use SDXL finetunes and LoRAs #21

Closed godimarcovr closed 8 months ago

godimarcovr commented 8 months ago

Hello, thanks for this amazing project. You mention in the README

Hotshot-XL can generate GIFs with any fine-tuned SDXL model. This means two things:

  1. You’ll be able to make GIFs with any existing or newly fine-tuned SDXL model you may want to use.
  2. If you'd like to make GIFs of personalized subjects, you can load your own SDXL based LORAs, and not have to worry about fine-tuning Hotshot-XL. This is awesome because it’s usually much easier to find suitable images for training data than it is to find videos. It also hopefully fits into everyone's existing LORA usage/workflows :) See more here.

I can't find out how to achieve n.1, which argument should be used to use a fine-tuned SDXL model? As for n.2, I have been trying to follow the instructions in the relevant section, thanks to the latest commit I am able to load the UNet from stabilityai/stable-diffusion-xl-base-1.0 in safetensors, but when I try to use a LoRA I get an error: "The following keys have not been correctly be renamed" followed by a bunch of state_dict keys such as "lora_te1_text_model_encoder" keys, "lora_te2_text_model_encoder" keys and "loraunet" keys. Maybe some renaming is needed?

For reference, I am mainly interested in using SDXL finetunes and loras from civitai.com such as https://civitai.com/models/131243/robocop for example, which are in .safetensors format.

Thank you for the great work!

aakashs commented 8 months ago

For 1 - explained here in README: https://github.com/hotshotco/Hotshot-XL#text-to-gif-with-personalized-loras. Use the --spatial_unet_base="path/to/stabilityai/stable-diffusion-xl-base-1.0/unet" \ parameter. If you are using the base Hotshot-XL model (not fine tuned at higher resolutions), we'd recommend using some base u-net that has been trained at or around the 512 aspect ratio.

We've tested LoRA compatibility with diffusers format LoRAs. Sounds like the keys are mismatching because your LoRAs are safetensors format? We hope to add support for other LoRA formats soon, and would also greatly appreciate any help in PRs!