lllyasviel / stable-diffusion-webui-forge

GNU Affero General Public License v3.0
7.38k stars 715 forks source link

Add support for ViT-L-14 clip alternative #1361

Open FreddyFnafbear opened 3 weeks ago

FreddyFnafbear commented 3 weeks ago

Recently I've seen a lot of recommendations for using ViT-L-14-BEST-smooth-GmP-ft instead of clip_l. However, when I try to load this model in Forge, I get an "AssertionError: You do not have CLIP state dict!". Forge seems to not recognize this model as a dict type and thus can't run it.

zer0int commented 3 weeks ago

Hey everyone,

I'm the author of ViT-L-14-BEST-smooth-GmP-ft.safetensor @ https://huggingface.co/zer0int/CLIP-GmP-ViT-L-14/tree/main and code @ https://github.com/zer0int/CLIP-fine-tune that was used to fine-tune the model.

I am assuming the issue may have been due to 1. Fine-tuning based on original OpenAI/CLIP code and 2. Then just converting the .pt model to .safetensors, without converting to the syntax HF uses for the model, and 3. without "detaching" the vision transformer (it's a full text-vision transformer model .safetensors).

For my previous model, I supplied a text encoder only HF format version, but it didn't seem like there was a particular interest in that - so I omitted the potential "choice confusion" for my latest model.

It seems there's demand for that after all, judging by this thread. Alas:

If you could either

  1. Point me to a script for the "absolutely right way to convert original OpenAI/CLIP to HF that works for everything" or
  2. Confirm the above older model "TE only, HF format" works correctly (I just reasoned it together based on what a HF CLIP-L looks like, alas can't gurantee it's 100% conforming),

then I would be happy to upload a 'proper' HF model / proper text encoder for this model, and in the future.

As there was also an issue with city96/ComfyUI-GGUF about this, I guess I'd prefer option (1) from somebody familiar with all these potential downstream issues of conversion (or a lack thereof).

Kind regards!

FreddyFnafbear commented 3 weeks ago

Hello @zer0int thanks for the response! While I can't help you with that script, I just tried out the TE only HF format of the older model you mentioned and that one does seem to work correctly. So I would greatly appreciate an upload of the same format for the newer model :-)

zer0int commented 3 weeks ago

@FreddyFnafbear

Here you go (please let me know if you encounter any issues!):

Even if it works fine for [this repo's code]: There are still uncertainties, e.g. expected dtype (should I explicitly define a dtype for the converted HF model on a per-component basis - or not?), I would nevertheless appreciate if somebody could point me to a "standard of the industry" script (if that exists). =)

MatthewK78 commented 3 weeks ago

Thanks zer0int, for all of your work. 😊👍

I'm a little late posting this, but here it is anyways. I put together some quick code earlier to convert it and do a few other things. It can take the Long version and replace or resize/interpolate the positional embedding down to 77 instead of 248. It works, and the output ends up being a little bit different than the normal one. 🤷‍♂️ I haven't tested it much yet.

https://gist.github.com/MatthewK78/6d946ed5736f3222603411fb80108c41

zer0int commented 3 weeks ago

@MatthewK78 ...and thank you very much for your work, too! 👍 Your 'quick code' is quite sophisticated; especially the resize/interpolate (as well as all the mismatch handling) could indeed come in handy.

...And I think I'll refer to 'merging' e.g. a TE with a ViT as the model receiving 'donor keys' in the future, haha! I like it - cheers! =)