Unify and upload hub weights for unclip text to image and image variation

williamberman commented 1 year ago

Is your feature request related to a problem? Please describe. We don't currently have a canonical hub upload under the kakaobrain org for the image variation weight i.e. the tests currently use fusing/karlo-image-variations-diffusers.

Describe the solution you'd like Update the https://huggingface.co/kakaobrain/karlo-v1-alpha weights so they can be used for the image variation pipeline.

Will add feature_extractor and image_encoder, and hence require an _ignore_dirs variable on the text to image pipeline so errors aren't logged on pipeline load.

averad commented 1 year ago

https://huggingface.co/kakaobrain/karlo-v1-alpha/discussions

williamberman commented 1 year ago

Hey @averad are you linking to a specific discussion on the hub relevant to the issue? I'm not sure I see what in particular you're linking to, thanks!

williamberman commented 1 year ago

Thoughts on cleanly combining multiple pipelines in the same hub repository:

Note that this is different than combining multiple weight loading formats/datatypes for the same pipeline i.e. safetensors/fp16. Multiple weights formats/datatypes can be handled by file naming convention.

Tl;dr Allow model_index.json to specify multiple config dicts for multiple different pipelines.

Our original thought for combining the two Karlo pipelines in the same repository was to upload all the modules to the same repository, combine all module specifications in the same model_index.json file, and add class variables on both pipelines that would specify which modules to ignore in model_index.json and not load when pulling the repo.

I think this is potentially brittle because it requires both pipelines to have knowledge about what modules the other has. Any time we update one pipeline with a new module, we must go look at all other pipelines that share the same hub repo and manually ignore the new module.

This is manageable in the current case where we only have two pipelines but adding more pipelines to the same repo could make it a headache. Additionally, it introduces ambiguity in the model_index.json in present and potential future config. I.e. currently, we add both a _class_name and a _diffusers_version config. We could assume the diffusers_version config applies to all pipelines sharing the same repo and the class name is the default pipeline to load when loading from the DiffusersPipeline class, but this starts to become rather implicit config and is asking for trouble imo.

Instead, I think we should allow model_index.json to specify that the hub repo can be used to load multiple pipelines. I.e. the Karlo model_index.json becomes …

{
    "_class_name": "UnCLIPPipeline",
    "_diffusers_version": "0.11.0.dev0",
    "decoder": [
            "diffusers",
            "UNet2DConditionModel"
    ],
    "decoder_scheduler": [
            "diffusers",
            "UnCLIPScheduler"
    ],
    "prior": [
            "diffusers",
            "PriorTransformer"
    ],
    "prior_scheduler": [
            "diffusers",
            "UnCLIPScheduler"
    ],
    "super_res_first": [
            "diffusers",
            "UNet2DModel"
    ],
    "super_res_last": [
            "diffusers",
            "UNet2DModel"
    ],
    "super_res_scheduler": [
            "diffusers",
            "UnCLIPScheduler"
    ],
    "text_encoder": [
            "transformers",
            "CLIPTextModelWithProjection"
    ],
    "text_proj": [
            "unclip",
            "UnCLIPTextProjModel"
    ],
    "tokenizer": [
            "transformers",
            "CLIPTokenizer"
    ],

    "_classes": [
    {
        "_class_name": "UnCLIPImageVariationPipeline",
        "_diffusers_version": "0.12.0.dev0",
        "decoder": [
                "diffusers",
                "UNet2DConditionModel"
        ],
        "decoder_scheduler": [
                "diffusers",
                "UnCLIPScheduler"
        ],
        "feature_extractor": [
                "transformers",
                "CLIPImageProcessor"
        ],
        "image_encoder": [
                "transformers",
                "CLIPVisionModelWithProjection"
        ],
        "super_res_first": [
                "diffusers",
                "UNet2DModel"
        ],
        "super_res_last": [
                "diffusers",
                "UNet2DModel"
        ],
        "super_res_scheduler": [
                "diffusers",
                "UnCLIPScheduler"
        ],
        "text_encoder": [
                "transformers",
                "CLIPTextModelWithProjection"
        ],
        "text_proj": [
                "unclip",
                "UnCLIPTextProjModel"
        ],
        "tokenizer": [
                "transformers",
                "CLIPTokenizer"
        ]
    }
    ]
}

We add one new top level underscore prefixed key, _classes to indicate other classes the repository can be used with. Because the rest of the config file remains the same and the new key is underscore prefixed, this is backwards compatible and the top level _class_name will be the default loaded pipeline when the repo is loaded from DiffusionPipeline.

We only have to make a small code change in DiffusionPipeline.from_pretrained where if _class_name doesn't match the current class and the current class is not DiffusionPipeline, we look in _classes for a config that matches the current class.

The different pipeline classes don't need any new config and almost all current logic in Diffustion_pipeline.from_pretrained can remain the same.

Potential downsides: save_pretrained can write out a pipeline different from one loaded with from_pretrained. This might be a bit confusing if users looks at a pipeline loaded from the hub and sees a different config file than their local saved copy. This would occur anyway with most cases of combining multiple pipelines in one repo due to unused modules. I think this is acceptable with some added docs.

Two different pipelines that share the same repo will have to name their modules the same. I think this is acceptable because we already do this by convention and have one potential mixin that relies on this https://github.com/huggingface/diffusers/pull/2009#discussion_r1084632619 .

patrickvonplaten commented 1 year ago

Thanks for the write-up! I'm not in favor of the new proposed model_index.json design as it makes it much harder to read and will explose complexity of the config.

Overall, I don't really think the case of UnCLIP leads to a "combination" explosion because UnCLIP arguably left out part of the model that was used for training it. The CLIP Image encoder is required when training UnCLIP (also the "text" variant), but not needed during inference. Therefore it's ok to not have it in the "inference" pipeline, but since it's required for training it's totally fine (probs even cleaner) to have in the same repo. Also, IMO it's very rare in my opinion that new fields of model_index.json are added after the integration of a new pipeline. This can happen only for weights that are used for training, but not necessary for inference IMO.

Upon a second reflection I do understand your point and we might indeed go down a dangerous path here and allow for too much hacking.

=> I would be in favor of one of the two:

1.) Abandon the idea of merging the image variation weights into the same repo and just create a new repo (this is also more in line with Stable Diffusion here: https://huggingface.co/lambdalabs/sd-image-variations-diffusers).
2.) Continue the original plan

I'm not in favor of the more complex config. Actually probably leaning more towards 1.) than 2.) upon second thought.

williamberman commented 1 year ago

Ok very good points!

The CLIP Image encoder is required when training UnCLIP (also the "text" variant), but not needed during inference. Therefore it's ok to not have it in the "inference" pipeline, but since it's required for training it's totally fine (probs even cleaner) to have in the same repo.

I think this makes a lot of sense and might mean we should have a general "ignore" capability when loading a pipeline. Currently we would ignore the image encoder implicitly because the unused modules are not present in model_index.json.

I do think that there might be advantages to combining pipelines under single repos but we should do some additional thought and solicit some more feedback from the community.

I would recommend we

For now, upload fusing/karlo-image-variations-diffusers to kakaobrain/karlo-v1-alpha-image-variations-diffusers
Open a new issue soliciting feedback on if/how to merge weights into one repo.

I think there might end up not being much demand for merging weights in one repo and potentially a few too many side cases to consider to make it worthwhile.

williamberman commented 1 year ago

We ended up just uploading the image variation weights under a separate repo https://huggingface.co/kakaobrain/karlo-v1-alpha-image-variations

Holding off on opening a separate issue soliciting feedback on if/how to merge weights into one repo as I personally haven't seen many asks for it

huggingface / diffusers

Unify and upload hub weights for unclip text to image and image variation #1915