Closed salahzoubi closed 9 months ago
I was scratching my head a lot with this issue a couple of days ago too, for it to work you have to use the normal image encoder not the one in the sdxl_models, I guess no one noticed it since most people just copy & paste the example.
I don't know the difference in quality with the image encoder but the shape comes directly from the model so I couldn't use the image encoder in the sdxl_models subdirectory without digging deeper in the model itself.
hi:
Thanks for the issue! Can you provide the image inputs you used so I can try to reproduce this on my end?
thanks!
YiYi
@yiyixuxu you can try ip-adapter-plus_sdxl_vit-h.safetensors,is can cause same problem
I was testing this more since I was just starting to implement the IP Adapters to my code when I commented, I think this is not a problem in the code but it just need to be clarified in the documentation.
The ip-adapter_sdxl.safetensors is the only one of the SDXL IP-Adapters where they used the OpenCLIP-ViT-bigG-14 hence it need the bigger image encoder, for the rest of them they switched to the OpenCLIP-ViT-H-14 so the rest of the IP Adapters can just use the smaller image encoder.
If people are not aware of this and they try to use all of them, they will eventually encounter this error even if most won’t even use the larger one since it requires another large download and there’s no visual difference in the results.
I was testing this more since I was just starting to implement the IP Adapters to my code when I commented, I think this is not a problem in the code but it just need to be clarified in the documentation.
The ip-adapter_sdxl.safetensors is the only one of the SDXL IP-Adapters where they used the OpenCLIP-ViT-bigG-14 hence it need the bigger image encoder, for the rest of them they switched to the OpenCLIP-ViT-H-14 so the rest of the IP Adapters can just use the smaller image encoder.
If people are not aware of this and they try to use all of them, they will eventually encounter this error even if most won’t even use the larger one since it requires another large download and there’s no visual difference in the results.
I tested the IP adapter model of SDXL according to your method, but I found that the performance was very poor compared to SD1.5.
Just to clarify this is not my method, is how the models were trained and the implementation made by the awesome contributors to the diffusers library, also its the same in all the UIs and APIs that use it.
I can't really compare the quality of SD 1.5 since I don't use it anymore, but I get good results with the demo images of diffusers and the official ones:
Diffuser demo with this image: | source | ip-adapter_sdxl_vit-h |
---|---|---|
Official demo image: | source | ip-adapter_sdxl_vit-h | ip-adapter-plus_sdxl_vit-h |
---|---|---|---|
I don't use the base model though, I always select the model that works best with the source image.
Just to clarify this is not my method, is how the models were trained and the implementation made by the awesome contributors to the diffusers library, also its the same in all the UIs and APIs that use it.
I can't really compare the quality of SD 1.5 since I don't use it anymore, but I get good results with the demo images of diffusers and the official ones: Official demo image:
source ip-adapter_sdxl_vit-h ip-adapter-plus_sdxl_vit-h I don't use the base model though, I always select the model that works best with the source image.
How do you load ip-adapter_sdxl_vit-h
and ip-adapter-plus_sdxl_vit-h
with SDXL ?
Only one that works for me is ip-adapter_sdxl
(the one from the docs)
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter_sdxl.safetensors")
This here gives me the shape error:
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.safetensors")
@BEpresent that's exactly the problem discussed and the misunderstanding that's not explained in the docs, to use any of the vit-h models you have to use the image encoder in the "models" subfolder and not the "sdxl_models" one, so you need to load it and pass it to the pipeline.
image_encoder = CLIPVisionModelWithProjection.from_pretrained("h94/IP-Adapter", subfolder="models/image_encoder", torch_dtype=torch.float16,).to("cuda")
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.safetensors")
edit: I don't use diffusers pipeline so I forgot to add the image_encoder loading part.
@BEpresent that's exactly the problem discussed and the misunderstanding that's not explained in the docs, to use any of the vit-h models you have to use the image encoder in the "models" subfolder and not the "sdxl_models" one, so you need to load it and pass it to the pipeline.
image_encoder = CLIPVisionModelWithProjection.from_pretrained("h94/IP-Adapter", subfolder="models/image_encoder", torch_dtype=torch.float16,).to("cuda") pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.safetensors")
edit: I don't use diffusers pipeline so I forgot to add the image_encoder loading part.
Thanks, is it recommended to use CLIPVisionModelWithProjection
from transformers
or diffusers
(the latter does not work for me) ?
# import works
from transformers import CLIPVisionModelWithProjection
# import does not work
from diffusers import CLIPVisionModelWithProjection
ImportError Traceback (most recent call last)
Cell In[10], line 1
----> 1 from diffusers import CLIPVisionModelWithProjection
ImportError: cannot import name 'CLIPVisionModelWithProjection' from 'diffusers' (/opt/conda/lib/python3.10/site-packages/diffusers/__init__.py)
now, continuing with the transformers variant for the moment, I assume one has to pass the image encoder to the diffusers pipeline
image_encoder = CLIPVisionModelWithProjection.from_pretrained("h94/IP-Adapter", subfolder="models/image_encoder", torch_dtype=torch.float16,).to("cuda")
pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="sdxl_models", weight_name="ip-adapter-plus_sdxl_vit-h.safetensors", image_encoder=image_encoder)
however this also leads to a shape mismatch:
RuntimeError: mat1 and mat2 shapes cannot be multiplied (514x1664 and 1280x1280)
diffusers doesn't have a CLIPVisionModelWithProjection
, the import should be from the transformers library.
Your problem is that your passing the image encoder to the ip adapter but you have to pass it to the pipeline. Also I made a repository with just the vit-h models and the corresponding encoder to make it simplier, you can use it directly if that makes thing easier for you:
pipeline.load_ip_adapter("ozzygt/sdxl-ip-adapter", "", weight_name="ip-adapter_sdxl_vit-h.safetensors")
diffusers doesn't have a
CLIPVisionModelWithProjection
, the import should be from the transformers library.
Then the docs indeed need to be updated https://huggingface.co/docs/diffusers/using-diffusers/loading_adapters as they import it from diffusers
there.
Your problem is that your passing the image encoder to the ip adapter but you have to pass it to the pipeline. Also
Thanks you're right, it works when passed to the initial pipeline, not in the pipeline.load_ip_adapter
part.
This solves it for me - I guess the docs could highlight the need to switch the image encoder between OpenCLIP-ViT-bigG-14
and OpenCLIP-ViT-H-14
when using SDXL with different IP adapters. While it is being mentioned it in the model card https://huggingface.co/h94/IP-Adapter , it was not intuitively clear to me just from the diffusers docs.
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Describe the bug
When using ip_adapters with controlnets and sdxl (whether sdxl-turbo or sdxl1.0) you get a shape mismatch when generating images. If you remove the ip_adapter things start working again. Not sure what the problem might be here?
Reproduction
Here's what I'm doing:
Logs
System Info
diffusers
version: 0.25.0.dev0Who can help?
No response