invoke-ai / InvokeAI

Invoke is a leading creative engine for Stable Diffusion models, empowering professionals, artists, and enthusiasts to generate and create visual media using the latest AI-driven technologies. The solution offers an industry leading WebUI, and serves as the foundation for multiple commercial products.
https://invoke-ai.github.io/InvokeAI/
Apache License 2.0
23.84k stars 2.45k forks source link

Inpainting-specific model support #1184

Closed db3000 closed 2 years ago

db3000 commented 2 years ago

I find inpainting frustrating as it takes a lot of tries to get something that matches the existing image nicely. It seems like there are inpainting-specific models that are much better at doing this, for example RunwayMLs model . Also its nice that the RunwayML model is trained with extra non-inpainting steps (akin to sd 1.5?) so it would be good to support txt2img using it as well

Is it possible to add support for this model here? I tried porting in the RunwayML changes directly but seems like more is needed.

lstein commented 2 years ago

As long as the model isn't using hypernetworks, you can import it into InvokeAI using the !import command on the CLI. See https://github.com/invoke-ai/InvokeAI/blob/main/docs/features/CLI.md#model-selection-and-importation

I'll give it a try myself later today in the event that more is needed than simply bringing in a new checkpoint file.

db3000 commented 2 years ago

It is not using hypernetworks, but it needs some changes to the samplers and generation code as the model has extra channels which need to be initialized to the init image and mask.

db3000 commented 2 years ago

Just porting in the model and config results in an error about not being able to find the LatentInpaintDiffusion class, when I brought that in and some other changes generation just silently fails

db3000 commented 2 years ago

For code changes from Runway ML's repo see: https://github.com/runwayml/stable-diffusion/compare/69ae4b3...main

There are some other implementations of this brewing in other forks that might be useful reference too.

Any-Winter-4079 commented 2 years ago

@db3000 On a related note, I'm confused about https://huggingface.co/runwayml/stable-diffusion-v1-5 I'm assuming it is his own version (?) and labeled it 1.5 to be more catchy? I also see he has the stable-diffusion-inpainting ckpt.

lstein commented 2 years ago

Ok, looks like a little bit of research and coding is in order. Thanks for taking it as far as you did. If you could summarize what you understand about the necessary changes that would be very helpful.

Lincoln

On Thu, Oct 20, 2022 at 8:08 AM Danny Beer @.***> wrote:

For code changes from Runway ML's repo see: runwayml/stable-diffusion@ 69ae4b3...main https://github.com/runwayml/stable-diffusion/compare/69ae4b3...main

There are some other implementations of this brewing in other forks that might be useful reference too.

— Reply to this email directly, view it on GitHub https://github.com/invoke-ai/InvokeAI/issues/1184#issuecomment-1285421284, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA3EVIAGJFC3VZAE5AAYODWEEY2BANCNFSM6AAAAAARKA6YRY . You are receiving this because you were assigned.Message ID: @.***>

--

Lincoln Stein

Head, Adaptive Oncology, OICR

Senior Principal Investigator, OICR

Professor, Department of Molecular Genetics, University of Toronto

Tel: 416-673-8514

Cell: 416-817-8240

@.***

*E*xecutive Assistant

Michelle Xin

Tel: 647-260-7927

@. @.>*

Ontario Institute for Cancer Research

MaRS Centre, 661 University Avenue, Suite 510, Toronto, Ontario, Canada M5G 0A3

@OICR_news https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Foicr_news&data=04%7C01%7CMichelle.Xin%40oicr.on.ca%7C9fa8636ff38b4a60ff5a08d926dd2113%7C9df949f8a6eb419d9caa1f8c83db674f%7C0%7C0%7C637583553462287559%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=PS9KzggzFoecbbt%2BZQyhkWkQo9D0hHiiujsbP7Idv4s%3D&reserved=0 | www.oicr.on.ca

Collaborate. Translate. Change lives.

This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization.

lstein commented 2 years ago

Apparently this is the bona fide Stable Diffusion 1.5 model.

On Thu, Oct 20, 2022 at 2:01 PM Lincoln Stein @.***> wrote:

Ok, looks like a little bit of research and coding is in order. Thanks for taking it as far as you did. If you could summarize what you understand about the necessary changes that would be very helpful.

Lincoln

On Thu, Oct 20, 2022 at 8:08 AM Danny Beer @.***> wrote:

For code changes from Runway ML's repo see: runwayml/stable-diffusion@ 69ae4b3...main https://github.com/runwayml/stable-diffusion/compare/69ae4b3...main

There are some other implementations of this brewing in other forks that might be useful reference too.

— Reply to this email directly, view it on GitHub https://github.com/invoke-ai/InvokeAI/issues/1184#issuecomment-1285421284, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA3EVIAGJFC3VZAE5AAYODWEEY2BANCNFSM6AAAAAARKA6YRY . You are receiving this because you were assigned.Message ID: @.***>

--

Lincoln Stein

Head, Adaptive Oncology, OICR

Senior Principal Investigator, OICR

Professor, Department of Molecular Genetics, University of Toronto

Tel: 416-673-8514

Cell: 416-817-8240

@.***

*E*xecutive Assistant

Michelle Xin

Tel: 647-260-7927

@. @.>*

Ontario Institute for Cancer Research

MaRS Centre, 661 University Avenue, Suite 510, Toronto, Ontario, Canada M5G 0A3

@OICR_news https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Foicr_news&data=04%7C01%7CMichelle.Xin%40oicr.on.ca%7C9fa8636ff38b4a60ff5a08d926dd2113%7C9df949f8a6eb419d9caa1f8c83db674f%7C0%7C0%7C637583553462287559%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=PS9KzggzFoecbbt%2BZQyhkWkQo9D0hHiiujsbP7Idv4s%3D&reserved=0 | www.oicr.on.ca

Collaborate. Translate. Change lives.

This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization.

--

Lincoln Stein

Head, Adaptive Oncology, OICR

Senior Principal Investigator, OICR

Professor, Department of Molecular Genetics, University of Toronto

Tel: 416-673-8514

Cell: 416-817-8240

@.***

*E*xecutive Assistant

Michelle Xin

Tel: 647-260-7927

@. @.>*

Ontario Institute for Cancer Research

MaRS Centre, 661 University Avenue, Suite 510, Toronto, Ontario, Canada M5G 0A3

@OICR_news https://can01.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftwitter.com%2Foicr_news&data=04%7C01%7CMichelle.Xin%40oicr.on.ca%7C9fa8636ff38b4a60ff5a08d926dd2113%7C9df949f8a6eb419d9caa1f8c83db674f%7C0%7C0%7C637583553462287559%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=PS9KzggzFoecbbt%2BZQyhkWkQo9D0hHiiujsbP7Idv4s%3D&reserved=0 | www.oicr.on.ca

Collaborate. Translate. Change lives.

This message and any attachments may contain confidential and/or privileged information for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this message in error, please contact the sender and delete all copies. Opinions, conclusions or other information contained in this message may not be that of the organization.

Any-Winter-4079 commented 2 years ago

https://huggingface.co/runwayml/stable-diffusion-v1-5/discussions/1 It's still being investigated, but at least it seems to be the true v1.5

We confirm there has been no breach of IP as flagged and we thank Stability AI for the compute donation to retrain the original model.

lstein commented 2 years ago

I downloaded and installed the 1.5 model that Hugging Face posted and it works quite well. However, it looks like there is some dispute going on between Stability AI and RunwayML, and it isn't at all clear that RunwayML's 1.5 is the same as StabilityAI's 1.5.

lstein commented 2 years ago

It apparently is the real deal? https://discord.com/channels/1020123559063990373/1020123559831539744/1032755319803215954

db3000 commented 2 years ago

Seems like https://huggingface.co/runwayml/stable-diffusion-v1-5/discussions/1 is closed now and Stability AI has withdrawn their takedown request.

Note (for general context) that the model being discussed there is separate from the inpainting model that this ticket is for. That can be found here: https://huggingface.co/runwayml/stable-diffusion-inpainting

Any-Winter-4079 commented 2 years ago

Preliminary results, but inpainting v1.5 and v1.5 seem to produce the same results for prompt2img inference (?) so maybe having both models around is redundant (?) -if inpainting model can do the best of both worlds.

Update: Also, using inpainting gives me the same result for both models. And both are 4.27GB. Are they the same model? Supposedly, there are not meant to be the same, right?

stable-diffusion-v1-5 Resumed from stable-diffusion-v1-2 - 595,000 steps at resolution 512x512 on "laion-aesthetics v2 5+" and 10 % dropping of the text-conditioning to improve classifier-free guidance sampling.

stable-diffusion-inpainting Resumed from stable-diffusion-v1-5 - then 440,000 steps of inpainting training at resolution 512x512 on “laion-aesthetics v2 5+” and 10% dropping of the text-conditioning. For inpainting, the UNet has 5 additional input channels (4 for the encoded masked-image and 1 for the mask itself) whose weights were zero-initialized after restoring the non-inpainting checkpoint. During training, we generate synthetic masks and in 25% mask everything.

Maybe we need to run that specific code @db3000 mentioned to see the different behavior between both models.

db3000 commented 2 years ago

Maybe we need to run that specific code @db3000 mentioned to see the different behavior between both models.

Right I think you need to initialize the model properly to take advantage of the inpainting.

How did you run against the inpainting model though? It crashes for me when I try.

db3000 commented 2 years ago

They seem to be different files with different checksums and different dimension weights inside (I don't know the exact significance of model.diffusion_model.input_blocks.0.0.weight, but that is what invoke.py complains about having the wrong size if you load the inpainting model with the normal config):

$ sum sd-v1-5-inpainting.ckpt
15650 4165467 sd-v1-5-inpainting.ckpt

$ python3.9 -c "import torch; print(torch.load('sd-v1-5-inpainting.ckpt')['state_dict']['model.diffusion_model.input_blocks.0.0.weight'].size())"
torch.Size([320, 9, 3, 3])

$ sum v1-5-pruned-emaonly.ckpt
40530 4165411 v1-5-pruned-emaonly.ckpt

$ python3.9 -c "import torch; print(torch.load('v1-5-pruned-emaonly.ckpt')['state_dict']['model.diffusion_model.input_blocks.0.0.weight'].size())"
torch.Size([320, 4, 3, 3])
db3000 commented 2 years ago

This is the error I get if I try to load sd-v1-5-inpainting.ckpt:

** model stable-diffusion-inpainting could not be loaded: Error(s) in loading state_dict for LatentDiffusion:
        size mismatch for model.diffusion_model.input_blocks.0.0.weight: copying a param with shape torch.Size([320, 9, 3, 3]) from checkpoint, the shape in current model is torch.Size([320, 4, 3, 3]).
Any-Winter-4079 commented 2 years ago

You're absolutely right. I was so tired I didn't even notice the message at the end :)

Caching model stable-diffusion-1.5 in system RAM Loading stable-diffusion-1.5-inpaint from models/ldm/stable-diffusion-v1/sd-v1-5-inpainting.ckpt | LatentDiffusion: Running in eps-prediction mode | DiffusionWrapper has 859.52 M params. | Making attention of type 'vanilla' with 512 in_channels | Working with z of shape (1, 4, 32, 32) = 4096 dimensions. | Making attention of type 'vanilla' with 512 in_channels ** model stable-diffusion-1.5-inpaint could not be loaded: Error(s) in loading state_dict for LatentDiffusion: size mismatch for model.diffusion_model.input_blocks.0.0.weight: copying a param with shape torch.Size([320, 9, 3, 3]) from checkpoint, the shape in current model is torch.Size([320, 4, 3, 3]).

No wonder it was the same result. It was defaulting back to 1.5. I'll try to see how to run torch.Size([320, 9, 3, 3])

db3000 commented 2 years ago

Hah, no worries. I just wanted to make sure I wasn't doing something wrong.

So if you use the RunwayML yaml config file with the model then it clears that error, but complains about missing LatentInpaintDiffusion class. And if you paste in that class from the RunwayML codebase to ddpm.py then generation just silently fails, seems like a bunch of other changes are needed.

Any-Winter-4079 commented 2 years ago

Yes, I am exactly at that point (silently failing). I tried porting changes from https://github.com/runwayml/stable-diffusion/commits/main (e.g. some updated ddim.py code was in ddim.py and sampler.py, so I updated the relevant pieces of both files) and then had to add personalization_config to configs/stable-diffusion/v1-inpainting-inference.yaml The models seemingly loads

invoke> !switch stable-diffusion-1.5-inpaint
>> Caching model stable-diffusion-1.4 in system RAM
>> Loading stable-diffusion-1.5-inpaint from models/ldm/stable-diffusion-v1/sd-v1-5-inpainting.ckpt
   | LatentInpaintDiffusion: Running in eps-prediction mode
   | DiffusionWrapper has 859.54 M params.
Keeping EMAs of 688.
   | Making attention of type 'vanilla' with 512 in_channels
   | Working with z of shape (1, 4, 32, 32) = 4096 dimensions.
   | Making attention of type 'vanilla' with 512 in_channels
   | Using more accurate float32 precision
>> Model loaded in 6.63s
>> Setting Sampler to k_lms

But fails when used

invoke> "3d pixar dog, disney dog, 3d render dog, unreal engine dog, 3d movie dog character" -s 50 -S 3479165689 -W 512 -H 512 -C 7.5 -I toby.png -A k_lms -f 0.75
>> loaded input image of size 576x576 from toby.png
>> Initial image has transparent areas. Will inpaint in these regions.
>> This input is larger than your defaults. If you run out of memory, please use a smaller image.
>> Using recommended DDIM sampler for inpainting.
>> target t_enc is 37 steps

invoke> 

I also tried the streamlit option (using this repo). streamlit run scripts/inpaint_st.py -- configs/stable-diffusion/v1-inpainting-inference.yaml models/ldm/stable-diffusion-v1/sd-v1-5-inpainting.ckpt but that has errors of its own KeyError: '27 is not registered'

Any-Winter-4079 commented 2 years ago

Okay, so in our repository it's failing in ldm/invoke/generator/inpaint.py -at least for me.

In his own version, inpaint_st.py calls def inpaint(sampler, image, mask, prompt, seed, scale, ddim_steps, num_samples=1, w=512, h=512): so maybe we can port this function to inpaint.py

db3000 commented 2 years ago

I think the RunwayML version also doesn't work with all the samplers, some implementations that claim to though are:

Any-Winter-4079 commented 2 years ago

Okay, I think I know where it fails (?) Our inpainting has code in ldm/invoke/generator/inpaint.py but also ldm/invoke/generator/base.py It is in this last file where it tries to run: with scope(self.model.device.type), self.model.ema_scope(): and fails. I assume it might have to do with ema scope, since when loading the 1.5-inpainting model I see Keeping EMAs of 688.


If I change to with scope(self.model.device.type): it gets further than before

xc = torch.cat([x] + c_concat, dim=1)
TypeError: can only concatenate list (not "NoneType") to list

which comes from ldm/models/diffusion/ddpm.py because the key is hybrid, set in v1-inpainting-inference.yaml conditioning_key: hybrid # important In contrast to our regular inference, which uses self.conditioning_key == 'crossattn'

Any-Winter-4079 commented 2 years ago

Back after a break, I may have been able to run it. Source and mask:

Screenshot 2022-10-21 at 17 57 56

Result:

Screenshot 2022-10-21 at 17 57 23

There's a few question marks, though, so I'll leave it for @lstein to do properly (I don't even know if I did it right).

What I did (apart from what I mentioned in https://github.com/invoke-ai/InvokeAI/issues/1184#issuecomment-1286884623) was to run python3 scripts/inpaint_st.py removing the streamlit parts and hardcoding the values, e.g.

sampler = initialize_model('configs/stable-diffusion/v1-inpainting-inference.yaml', 'models/ldm/stable-diffusion-v1/sd-v1-5-inpainting.ckpt')
image = Image.open("test4.png")
seed = 1
num_samples = 1
scale = 7.5
ddim_steps = 50

etc. and in the end, saving the result

result[0].save('outputs/new/test.png', 'PNG')

PS: Not sure how the mask works. Do I use the same image with transparent parts for the mask? I may try later, to see if I can set a proper mask.

Any-Winter-4079 commented 2 years ago

A bit better, changing only the face. (prompt 'dog face')

image

Also, with prompt 'shiba inu':

Screenshot 2022-10-21 at 23 44 52
Any-Winter-4079 commented 2 years ago

@db3000 This is the code, if you want to give it a look / try. inpaint_st.py.zip My mask is the only weird thing, where everything is transparent except the part to inpaint (but I think we could invert it in the code to be more intuitive) test8.png.zip

Screenshot 2022-10-22 at 00 01 17

A few more results:

image
Any-Winter-4079 commented 2 years ago

Yep, it's more intuitive like this. inpaint_st.py.zip Initial

Screenshot 2022-10-22 at 00 08 36

Mask (note it's a screenshot of it)

Screenshot 2022-10-22 at 00 08 03

Prompt taylor swift Result

Screenshot 2022-10-22 at 00 10 59

I have a Mac so I can't really compare with cuda results other people put out there, so not sure if I am running this correctly. I'll wait for some other people to run this : ) But at least it looks decent!

db3000 commented 2 years ago

Thanks! I will try this out later

Neosettler commented 2 years ago

Based on the description. It seems that 1.5 inpainting model would behave the same as the standard 1.5 (if not using inpainting feature).

lstein commented 2 years ago

This is very cool indeed. Based on this it will not be hard at all to port it over to InvokeAI, and the code will get much simpler because all the masking work is already done in the model. I also notice that there is cross-attention built into this model.

I'm a little confused about where you are running the inpaint_st.py script, however. On line 6 there is this:

from main import instantiate_from_config

However, if this is referring to the main.py file, then instantiate_from_config isn't defined there. It is defined in ldm.util. So how does this work?

Any-Winter-4079 commented 2 years ago

I run it: python scripts/inpaint_st.py, so inpaint_st.py it's inside scripts directory. On main.py we have from ldm.util import instantiate_from_config so it imports

def instantiate_from_config(config, **kwargs):
    if not 'target' in config:
        if config == '__is_first_stage__':
            return None
        elif config == '__is_unconditional__':
            return None
        raise KeyError('Expected key `target` to instantiate.')
    return get_obj_from_str(config['target'])(
        **config.get('params', dict()), **kwargs
    )
lstein commented 2 years ago

Before I get into the inpainting functionality, I'm trying to modify our code base so that the inpainting model can be used for vanilla txt2img and img2img. I've gotten it to the point where the inference loop in ldm.invoke.generator.base actually runs (by removing self.model.ema_scope(). After this, I added a check for c_concat being None in ddpm.py and fell back to using crosstalk, which got past this step.

However, running a txt2image prompt now gives:

File "/u/lstein/projects/SD/InvokeAI/ldm/models/diffusion/ddim.py", line 45, in p_sample
    e_t_uncond, e_t = self.model.apply_model(x_in, t_in, c_in).chunk(2)
  File "/u/lstein/projects/SD/InvokeAI/ldm/models/diffusion/ddpm.py", line 1441, in apply_model
    x_recon = self.model(x_noisy, t, **cond)
  File "/u/lstein/.conda/envs/invokeai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/u/lstein/projects/SD/InvokeAI/ldm/models/diffusion/ddpm.py", line 2174, in forward
    out = self.diffusion_model(x, t, context=cc)
  File "/u/lstein/.conda/envs/invokeai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/u/lstein/projects/SD/InvokeAI/ldm/modules/diffusionmodules/openaimodel.py", line 806, in forward
    h = module(h, emb, context)
  File "/u/lstein/.conda/envs/invokeai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/u/lstein/projects/SD/InvokeAI/ldm/modules/diffusionmodules/openaimodel.py", line 90, in forward
    x = layer(x)
  File "/u/lstein/.conda/envs/invokeai/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl
    return forward_call(*input, **kwargs)
  File "/u/lstein/.conda/envs/invokeai/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 447, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/u/lstein/.conda/envs/invokeai/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 443, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Given groups=1, weight of size [320, 9, 3, 3], expected input[2, 4, 64, 64] to have 9 channels, but got 4 channels instead

I think this is because the model requires the additional channels delivered by c_concat, and indeed inpaint_st.py has a loop in which it adds several channels to c_concat. I have to determine whether you can dummy up these channels when there is no initial image or mask.

Do we know for sure that the inpainting model can be used for txt2image on its own? It will be messy to keep two models around and swap them back and forth depending on what the user is asking it to do.

lstein commented 2 years ago

@Any-Winter-4079 could you share with me the changes you made to sampler.py and ddim.py? inpaint_st.py works perfectly for me, with the exception that the masked region is simply replaced with a nice reconstructed background rather than with the prompt concept, as if the conditional_conditioning were not being provided.

lstein commented 2 years ago

Folks, thinking more about how to integrate the custom runwayML inpainting model into InvokeAI. I'm going to assume that it's technically infeasible to get the inpainting model to do pure txt2img and won't pursue this line of inquiry unless I hear otherwise. So here's what we can do instead.

  1. models.yaml will have new inpainting_weights and inpainting_config keys, as in:
    stable-diffusion-1.5:
    description: Stable Diffusion inference model version 1.5
    config: configs/stable-diffusion/v1-inference.yaml
    weights: models/ldm/stable-diffusion-v1/v1-5-pruned-emaonly.ckpt
    vae: models/ldm/stable-diffusion-v1/vae-ft-mse-840000-ema-pruned.ckpt
    inpainting_weights: sd-v1-5-inpainting.ckpt
    inpainting_config: configs/stable-diffusion/v1-inpainting-inference.yaml
    width: 512
    height: 512
  2. When user provides a mask to img2img the code will look for these two keys in the current model's config, and if they are present will load the model (or switch from the CPU cached version as appropriate) and run the new inpainting code.
  3. When any other operation is requested, the inpainting model will be swapped out so that it isn't used.

The main problem here is that there will be a delay for loading and a slight delay for switching. Also both models are pretty big and will eat CPU RAM.

Improvements on this idea most welcome.

Any-Winter-4079 commented 2 years ago

Do we know for sure that the inpainting model can be used for txt2image on its own? It will be messy to keep two models around and swap them back and forth depending on what the user is asking it to do.

I haven't tried inpainting for regular txt2img (nor read anything about it on Reddit). I think if it worked, results would be different given the additional training steps of that model, but maybe it could work. I got to the same errors as you (self.model.ema_scope(), etc.) and eventually left it when it said RuntimeError: Given groups=1, weight of size [320, 9, 3, 3], expected input[2, 4, 64, 64] to have 9 channels, but got 4 channels instead. At that point I switched my strategy to trying to run inpaint_st.py. But maybe adding 5 more channels would produce coherent images. It's a matter of trying.

Any-Winter-4079 commented 2 years ago

@Any-Winter-4079 could you share with me the changes you made to sampler.py and ddim.py? inpaint_st.py works perfectly for me, with the exception that the masked region is simply replaced with a nice reconstructed background rather than with the prompt concept, as if the conditional_conditioning were not being provided.

Changes to sampler.py: Right before # check to see if make_schedule() has run, and if not, run it

if conditioning is not None:
            if isinstance(conditioning, dict):
                ctmp = conditioning[list(conditioning.keys())[0]]
                while isinstance(ctmp, list):
                    ctmp = ctmp[0]
                cbs = ctmp.shape[0]
                if cbs != batch_size:
                    print(f"Warning: Got {cbs} conditionings but batch-size is {batch_size}")
            else:
                if conditioning.shape[0] != batch_size:
                    print(f"Warning: Got {conditioning.shape[0]} conditionings but batch-size is {batch_size}")

Changes to ddim.py: After t_in = torch.cat([t] * 2)

if isinstance(c, dict):
                assert isinstance(unconditional_conditioning, dict)
                c_in = dict()
                for k in c:
                    if isinstance(c[k], list):
                        c_in[k] = [
                            torch.cat([unconditional_conditioning[k][i], c[k][i]])
                            for i in range(len(c[k]))
                        ]
                    else:
                        c_in[k] = torch.cat([unconditional_conditioning[k], c[k]])
            else:
                c_in = torch.cat([unconditional_conditioning, c])

Changes to ddpm.py:

from omegaconf import ListConfig

In class LatentDiffusion:

    @torch.no_grad()
    def get_unconditional_conditioning(self, batch_size, null_label=None):
        if null_label is not None:
            xc = null_label
            if isinstance(xc, ListConfig):
                xc = list(xc)
            if isinstance(xc, dict) or isinstance(xc, list):
                c = self.get_learned_conditioning(xc)
            else:
                if hasattr(xc, "to"):
                    xc = xc.to(self.device)
                c = self.get_learned_conditioning(xc)
        else:
            # todo: get null label from cond_stage_model
            raise NotImplementedError()
        c = repeat(c, "1 ... -> b ...", b=batch_size).to(self.device)
        return c

And then add another class:

class LatentInpaintDiffusion(LatentDiffusion):
    def __init__(
        self,
        concat_keys=("mask", "masked_image"),
        masked_image_key="masked_image",
        finetune_keys=None,
        *args,
        **kwargs,
    ):
        super().__init__(*args, **kwargs)
        self.masked_image_key = masked_image_key
        assert self.masked_image_key in concat_keys
        self.concat_keys = concat_keys
    @torch.no_grad()
    def get_input(
        self, batch, k, cond_key=None, bs=None, return_first_stage_outputs=False
    ):
        # note: restricted to non-trainable encoders currently
        assert (
            not self.cond_stage_trainable
        ), "trainable cond stages not yet supported for inpainting"
        z, c, x, xrec, xc = super().get_input(
            batch,
            self.first_stage_key,
            return_first_stage_outputs=True,
            force_c_encode=True,
            return_original_cond=True,
            bs=bs,
        )
        assert exists(self.concat_keys)
        c_cat = list()
        for ck in self.concat_keys:
            cc = (
                rearrange(batch[ck], "b h w c -> b c h w")
                .to(memory_format=torch.contiguous_format)
                .float()
            )
            if bs is not None:
                cc = cc[:bs]
                cc = cc.to(self.device)
            bchw = z.shape
            if ck != self.masked_image_key:
                cc = torch.nn.functional.interpolate(cc, size=bchw[-2:])
            else:
                cc = self.get_first_stage_encoding(self.encode_first_stage(cc))
            c_cat.append(cc)
        c_cat = torch.cat(c_cat, dim=1)
        all_conds = {"c_concat": [c_cat], "c_crossattn": [c]}
        if return_first_stage_outputs:
            return z, all_conds, x, xrec, xc
        return z, all_conds
Any-Winter-4079 commented 2 years ago

@lstein This might also be easier for you. https://github.com/runwayml/stable-diffusion/commits/main It is the list of commits in the runwayml repo.

lstein commented 2 years ago

I missed the sampler.py change because it didn't seem to be doing anything except error checking. The cbs variable is never used again in the code. I still don't see what it is doing.

This might also be easier for you. https://github.com/runwayml/stable-diffusion/commits/main

In fact I had cloned this repository and did a git diff to get the changes.

lstein commented 2 years ago

Nope, that didn't fix it. I'm passing the prompt "macaw" to an image of myself with a masked parrot on the shoulder. It very nicely removes the parrot, but doesn't change it to a macaw: Lincoln-and-Parrot-512

Lincoln-and-Parrot-512-transparent

test-3

I'm using the inpaint_st.py script you'd zipped. Any idea of what I might be doing wrong?

lstein commented 2 years ago

Do we know for sure that the inpainting model can be used for txt2image on its own? It will be messy to keep two models around and swap them back and forth depending on what the user is asking it to do.

I haven't tried inpainting for regular txt2img (nor read anything about it on Reddit). I think if it worked, results would be different given the additional training steps of that model, but maybe it could work. I got to the same errors as you (self.model.ema_scope(), etc.) and eventually left it when it said RuntimeError: Given groups=1, weight of size [320, 9, 3, 3], expected input[2, 4, 64, 64] to have 9 channels, but got 4 channels instead. At that point I switched my strategy to trying to run inpaint_st.py. But maybe adding 5 more channels would produce coherent images. It's a matter of trying.

I doubt it will work and don't want to spend time pursuing it. I'm going to use the strategy I outlined in the later comment... I just need to debug whatever's wrong in the inpaint_st.py script and library and then I can port.

db3000 commented 2 years ago

Folks, thinking more about how to integrate the custom runwayML inpainting model into InvokeAI. I'm going to assume that it's technically infeasible to get the inpainting model to do pure txt2img and won't pursue this line of inquiry unless I hear otherwise. So here's what we can do instead.

  1. models.yaml will have new inpainting_weights and inpainting_config keys, as in:
stable-diffusion-1.5:
  description: Stable Diffusion inference model version 1.5
  config: configs/stable-diffusion/v1-inference.yaml
  weights: models/ldm/stable-diffusion-v1/v1-5-pruned-emaonly.ckpt
  vae: models/ldm/stable-diffusion-v1/vae-ft-mse-840000-ema-pruned.ckpt
  inpainting_weights: sd-v1-5-inpainting.ckpt
  inpainting_config: configs/stable-diffusion/v1-inpainting-inference.yaml
  width: 512
  height: 512
  1. When user provides a mask to img2img the code will look for these two keys in the current model's config, and if they are present will load the model (or switch from the CPU cached version as appropriate) and run the new inpainting code.
  2. When any other operation is requested, the inpainting model will be swapped out so that it isn't used.

The main problem here is that there will be a delay for loading and a slight delay for switching. Also both models are pretty big and will eat CPU RAM.

Improvements on this idea most welcome.

My understanding is txt2img works with this model by passing in a fully masked dummy image for the extra channels, for example https://github.com/AUTOMATIC1111/stable-diffusion-webui/blob/6cbb04f7a5e675cf1f6dfc247aa9c9e8df7dc5ce/modules/processing.py#L559

Probably worth trying something like that out first as it would be much simpler to pass in dummy image/mask in txt2img mode if the model needs it than introducing a new split inpainting model concept for all models in the code.

lstein commented 2 years ago

Interesting. I wouldn't have thought that would work. I'll give it a try now.

Any-Winter-4079 commented 2 years ago

Are you setting the prompt in inpaint_st.py? e.g. prompt = 'taylor swift'

lstein commented 2 years ago

Yes!

lstein commented 2 years ago

When you added personalization_config to v1-inpainting-inference.yaml, did you just cut and paste the same section from v1-inference.yaml?

Any-Winter-4079 commented 2 years ago

Indeed. I also get your image.

Any-Winter-4079 commented 2 years ago

When you added personalization_config to v1-inpainting-inference.yaml, did you just cut and paste the same section from v1-inference.yaml?

Yep.

Any-Winter-4079 commented 2 years ago

Macaw bird does work, though.

lstein commented 2 years ago

Sorry. I don't understand. "macaw" doesn't work, but "macaw bird" does?

Any-Winter-4079 commented 2 years ago
Screenshot 2022-10-24 at 23 55 59

I don't know if it's a thing with single word prompts (a bug) or we miss some step from runwayml's implementation. But yeah, try with prompt: macaw bird.

lstein commented 2 years ago

Bloody hell. Neither "macaw" nor "macaw bird" work, but "Macaw bird" does. Case sensitive?

Any-Winter-4079 commented 2 years ago

Actually, the lowercase version works for me. Maybe it's seed specific. I also tried with eagle and again nothing. But an eagle sitting on shoulder:

image