Add VAE to txt-to-speech Inference

digiphd commented 1 year ago

Hey hey!

So I am using some models that either have VAE baked in or require a separate VAE to be defined during inference like this:

model = "CompVis/stable-diffusion-v1-4"
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse")
pipe = StableDiffusionPipeline.from_pretrained(model, vae=vae)

when I either manually added the vae or used a model with a vae baked in for the MODEL_ID, I received the following error, for example with the model dreamlike-art/dreamlike-photoreal-2.0

'name': 'RuntimeError', 'message': 'Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same', 'stack': 'Traceback (most recent call last):\n  File "/api/app.py", line 382, in inference\n    images = pipeline(**model_inputs).images\n  File "/opt/conda/envs/xformers/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context\n    return func(*args, **kwargs)\n  File "/api/diffusers/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion.py", line 606, in __call__\n    noise_pred = self.unet(latent_model_input, t, encoder_hidden_states=prompt_embeds).sample\n  File "/opt/conda/envs/xformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl\n    return forward_call(*input, **kwargs)\n  File "/api/diffusers/src/diffusers/models/unet_2d_condition.py", line 475, in forward\n    sample = self.conv_in(sample)\n  File "/opt/conda/envs/xformers/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl\n    return forward_call(*input, **kwargs)\n  File "/opt/conda/envs/xformers/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 457, in forward\n    return self._conv_forward(input, self.weight, self.bias)\n  File "/opt/conda/envs/xformers/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 453, in _conv_forward\n    return F.conv2d(input, weight, bias, self.stride,\nRuntimeError: Input type (torch.cuda.FloatTensor) and weight type (torch.cuda.HalfTensor) should be the same

Line 382 in the inference function which looks like this:

images = pipeline(**model_inputs).images

Perhaps we need to add a .half() to the input somewhere, not sure where. though.

Any help would be greatly appreciated!

It's the last hurdle I am facing to be generating images.

IDEA: It would be awesome if we could define an optional VAE when making API call like this:

model_inputs["callInputs"] = {
                "MODEL_ID": "runwayml/stable-diffusion-v1-5",
                "PIPELINE": "StableDiffusionPipeline",
                "SCHEDULER": self.scheduler,
                "VAE": "stabilityai/sd-vae-ft-mse"
            }

gadicc commented 1 year ago

Hey, @digiphd! Thanks for getting this on my radar. I'll have a chance to take a look during this coming week.

As a preliminary comment, I like the idea of being able to switch the VAE at runtime, although there will be a lot of work involved to adapt how we currently cache models.

P.S. If you're impatient, in the meantime, I think you could probably:

Clone https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/fp16
Replace the vae directory with the contents from https://huggingface.co/stabilityai/sd-vae-ft-mse/tree/main
Upload that "new" model back to HuggingFace and build docker-diffusers-api with that (it's possible without uploading back to huggingface, but a bit more complicated).

Alternatively, with your current setup, it's possible that if you set MODEL_PRECISION="" and MODEL_REVISION="", you might get past that error by using full precision (but inference will be slower; nevertheless, maybe something useful in the interim).

Anyways, have a great weekend and we'll be in touch next week :grinning:

digiphd commented 1 year ago

Hey @gadicc great, thanks for your suggestions I will give them ago! You're a legend!

Another thing I was wondering, was if docker-diffusers-api text-to-image supports negative keywords?

I did put it as an argument and it seemed to negatively affect the output images.

gadicc commented 1 year ago

Yup! negative_prompt modelInput, as it seems you worked out.

The modelInput's are passed directly to the relevant diffusers' pipeline, so you can use whatever arguments are supported by that pipeline. I made this a little clearer in the README a few days ago with links to the common diffusers pipelines, as I admit it wasn't so obvious until then :sweat_smile:

There's also a note there now about using the lpw_stable_diffusion pipeline which supports longer prompts and prompt weights.

Thanks for all the kind words! :raised_hands:

gadicc commented 1 year ago

Hey @digiphd, I had a quick moment to try dreamlike-art/dreamlike-photoreal-2.0 and it works out the box for me, in both full and half precision. What version of docker-diffusers-api are you using?

These worked for me:

$ python test.py txt2img --call-arg MODEL_ID="dreamlike-art/dreamlike-photoreal-2.0" --call-arg MODEL_PRECISION=""
$ python test.py txt2img --call-arg MODEL_ID="dreamlike-art/dreamlike-photoreal-2.0" --call-arg MODEL_PRECISION="fp16"

I just tried in the default "runtime" config. If you have this issue specifically in the -build-download variant, let me know.

gadicc commented 1 year ago

kiri-art / docker-diffusers-api

Add VAE to txt-to-speech Inference #32