Closed james77777778 closed 3 weeks ago
Can you please add the presets and the conversion script? It could be in a follow up PR.
Sure. Will submit a PR soon. However, the questions remain:
Task
should we introduce? The original FeatureExtractor
seems somewhat ambiguous to me.CLIPPreprocessor
to accept image-text pairs for the new CLIPBackbone
or new task? If so, we would need to update the SD3 implementation as well.
Related to #1752
Colab for demonstrating the prediction of the ported backbone: https://colab.research.google.com/drive/1MgrQ1jq8wcICfoSbxp075wfap2qYADGs?usp=sharing
Preset:
openai/clip-vit-base-patch32
(should work for all CLIP models)Outputs (probability):
CLIPModel
: 99.5% vs. 0.5%CLIPBackbone
: 99.7% vs. 0.3%There are some questions about the upcoming task definition:
Task
should we introduce? The originalFeatureExtractor
seems somewhat ambiguous to me.CLIPPreprocessor
to accept image-text pairs for the newCLIPBackbone
or new task? If so, we would need to update the SD3 implementation as well.cc @divyashreepathihalli