huggingface / diffusers

🤗 Diffusers: State-of-the-art diffusion models for image and audio generation in PyTorch and FLAX.
https://huggingface.co/docs/diffusers
Apache License 2.0
26.25k stars 5.41k forks source link

Using a Mask with the img2img Pipeline (perhaps as callback function?) #2073

Closed nicollegah closed 1 year ago

nicollegah commented 1 year ago

I am having a looped img2img pipeline to produce videos - similar to deforum. For that, I would love some objects in the scene to remain unchanged. The way to do this, is to use a mask. With huggingface components, the only way I am aware of achieving this is the inpainting models - however that is a different thing. I want a certain part of the images to remain unchanged, but the normal img2img pipeline acting on the rest.

There is the callback function in the img2img pipeline and my guess is, that one could use that to achieve what I am asking for, but I am unsure how.

The deforum script somehow does it (I think here: https://github.com/HelixNGC7293/DeforumStableDiffusionLocal/blob/17bf2f9b07167f8958065fa1b322ff83029d95b2/deforum-stable-diffusion/helpers/generate.py#L268 although there are several similar lines in the repo).

Thanks for your amazing work.

patrickvonplaten commented 1 year ago

@nicollegah, actually you could also try to make use of the depth2image pipeline and pass a custom depth mask: https://github.com/huggingface/diffusers/blob/fc8afa3ab5eb840ab0da5aadb629bf671eef9a39/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_depth2img.py#L429 which is adapted to your use case. In this sense depth2image is a combination of img2img and inpainting.

Maybe this could help? :-)

brucethemoose commented 1 year ago

I am experimenting with something similar right now. If you are amenable to external tools, its quite easy to generate a "motion mask" in vapoursynth so that unchanged parts of the scene remain static in the frames. There is a simple motionmask plugin and a more involved way to do it with mvtools.

There are additional postprocessing plugins like reduceflicker or temporalsoften than also help with coherence.

I am also working with a modification if the img2img pipeline that takes 2 frames instead of one (the current frame, and the previous, processed, motion compensated frame), and merges them in latent space before feeding them to SD to try and get some temporal coherence out of it.

brucethemoose commented 1 year ago

PS: it would be nice if the img2img pipeline took a "latents" input like txt2img does, to make generation more consistent between frames by feeding it the same latent noise.

nicollegah commented 1 year ago

I am experimenting with something similar right now. If you are amenable to external tools, its quite easy to generate a "motion mask" in vapoursynth so that unchanged parts of the scene remain static in the frames. There is a simple motionmask plugin and a more involved way to do it with mvtools.

There are additional postprocessing plugins like reduceflicker or temporalsoften than also help with coherence.

I am also working with a modification if the img2img pipeline that takes 2 frames instead of one (the current frame, and the previous, processed, motion compensated frame), and merges them in latent space before feeding them to SD to try and get some temporal coherence out of it.

Thanks for your answer. All this sounds very good. Particularly the last idea sounds brilliant. Would be very keen on learning from you :-) I will look into mvtools. But I wouldn't mind manipulating the mask myself if I could only apply one.

For coherence, Deforum adds noise in the individual steps and the noise is changing every iteration - but I don't really understand how and why this works.

nicollegah commented 1 year ago

@nicollegah, actually you could also try to make use of the depth2image pipeline and pass a custom depth mask:

https://github.com/huggingface/diffusers/blob/fc8afa3ab5eb840ab0da5aadb629bf671eef9a39/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_depth2img.py#L429

which is adapted to your use case. In this sense depth2image is a combination of img2img and inpainting. Maybe this could help? :-)

Interesting idea. I will try it out.

brucethemoose commented 1 year ago

Deforum is txt2vid (or img2vid?), so its strategy is much different than vid2vid..

If the txt2vid/img2vid are what you are trying to do, then I think misinterpreted, and you are maybe better off staying within pytorch rather than jumping out to vapoursynth.

nicollegah commented 1 year ago

@nicollegah, actually you could also try to make use of the depth2image pipeline and pass a custom depth mask:

https://github.com/huggingface/diffusers/blob/fc8afa3ab5eb840ab0da5aadb629bf671eef9a39/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_depth2img.py#L429

which is adapted to your use case. In this sense depth2image is a combination of img2img and inpainting. Maybe this could help? :-)

But this means I cannot use the normal v1.5 model and its awesome fine-tuned counterparts, right? I would really prefer to use one of the "standard" models.

And yes, I want to do text-to-video.

patrickvonplaten commented 1 year ago

PS: it would be nice if the img2img pipeline took a "latents" input like txt2img does, to make generation more consistent between frames by feeding it the same latent noise.

I think this would be very nice indeed! Would love to review a PR for it :-) Will also assign myself in case I find time (probably not in the next 2 weeks though)

brucethemoose commented 1 year ago

But this means I cannot use the normal v1.5 model and its awesome fine-tuned counterparts, right? I would really prefer to use one of the "standard" models.

And yes, I want to do text-to-video.

@nicollegah

Yeah that is my dilemma as well. The hundreds (probably thousands by now?) of community models are all 3-channel no-mask models, so a single latent space input image is all you got. InvokeAI and the Automatic UI can seemingly use the community models for inpainting, but I'm not sure how they manage it.

One other thing you can work with (without switching to inpainting models) is the input prompt. You can take the previous frame as a "style input" like the image variation pipeline does, either using it exclusively and merging it with the text prompt, and maybe that will make the output conform to the previous frame more closely.

vvsotnikov commented 1 year ago

Isn't StableDiffusionInpaintPipelineLegacy basically a masked img2img? From what I see it uses the same approach as Deforum or Automatic's webui: https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint_legacy.py#L661

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

haofanwang commented 1 year ago

Isn't StableDiffusionInpaintPipelineLegacy basically a masked img2img? From what I see it uses the same approach as Deforum or Automatic's webui: https://github.com/huggingface/diffusers/blob/main/src/diffusers/pipelines/stable_diffusion/pipeline_stable_diffusion_inpaint_legacy.py#L661

This is exactly what I'm looking for! Thanks bro.