[SD-XL] Enhanced inference control

bghira commented 1 year ago

Feature Request: Enhanced Control of the Inference Process in SDXL Pipeline

We are seeking to enhance the control over the denoising process within the SDXL pipeline. The objective is to better handle the generation and refinement of images through fine-tuning of noise application and control of the inference process.

Describe the solution you'd like

Introduce a new parameter, max_inference_steps: This new parameter, optional, will control the number of steps in the inference process before returning the outputs. This provides an early exit option during the denoising process. For instance, if num_inference_steps=20 and max_inference_steps=10, the pipeline will output a latent that is half denoised.
Introduce a new parameter, add_noise: This optional parameter, defaulting to True for backward compatibility, controls the addition of noise to the input in the SDXL Img2Img pipeline. When set to False, this parameter will prevent further noise from being added to the inputs. We need this, so that the details from the base image are not overwritten by the refiner, which does not have great composition in its data distribution.
Introduce a new parameter, first_inference_step: This optional parameter, defaulting to None for backward compatibility, is intended for the SDXL Img2Img pipeline. When set, the pipeline will not use the strength parameter to reduce num_inference_steps, and early timesteps will be skipped, allowing alignment and reducing artifacts.

Describe alternatives you've considered

Using the callback function from the base pipeline to retrieve the latents halfway through, and pass them in as latents to the Refiner model is a hack approach, but it's the only other one there is.

Intended outcome

These enhancements will allow for more nuanced and precise control over the image generation and refining process within the SDXL pipeline, leading to more effective and desired results.

Kindly review and consider this proposal. I can begin this work, if it is accepted.

bghira commented 1 year ago

cc @patrickvonplaten @sayakpaul

sayakpaul commented 1 year ago

Seems like interesting additions! Thanks for being so detailed. Do you have some results for us to take a look at with these changes?

SytanSD commented 1 year ago

Hello, I am Sytan, the creator of said workflow. I have been working with Comfy himself to make an optimized SDXL workflow that is not only faster than the traditionally shared img2img workflow, but also higher quality. All of this is being run locally on a 3080 10GB GPU. I can attach some example images below.

From Left to Right: Punk

Img2img (24 seconds)
Mine (15 seconds)
Base only (18 seconds)

Comfy Punk From Left to Right: Forest

Base (19 seconds)
Img2Img (25 seconds)
Mine (15 seconds)

Comfy Spring Forest From Left to Right: Statue

Img2Img (23 seconds)
Base (18 seconds)
Mine (14 seconds)

Comfy Statue

I hope these results and time savings are satisfactory. I will be posting a live 10 image comparison vote to Reddit soon to make sure that people enjoy my results blindly.

If any of you have questions about the workflow or other details, please refer back to @bghira

sayakpaul commented 1 year ago

Oh wow. The results are so amazing! 🤯

While reporting the timings, could you also shed some details on what do you mean by "Img2Img" and "Base")?

SytanSD commented 1 year ago

I have just talked to Joe Penna from SAI, and he has given me the clear to share my information. I will be posting more info tomorrow when I have it all collected!

bghira commented 1 year ago

img2img refers to the standard diffusers example for the refiner where all timesteps are completed by the sdxl base model.

base is just sdxl without refiner.

patrickvonplaten commented 1 year ago

Interesting thread! Let's wait until @SytanSD's answer then and see how we can most elegantly add the improved workflow?

bghira commented 1 year ago

@SytanSD and I have been working together on this enhancement and the Diffusers work will be done by myself, with the background and majority of the research having been done by him. My explanation above is from him :)

sayakpaul commented 1 year ago

Superb! Would be down to review the PRs!

JelloWizard commented 1 year ago

I have just talked to Joe Penna from SAI, and he has given me the clear to share my information. I will be posting more info tomorrow when I have it all collected!

are you ScythSergal from Reddit?

bghira commented 1 year ago

I've spent time today implementing this in my partial-diffusion branch.

here is the difference between a .5 strength before, and after the changes (and with the new inference parameters).

1688962505 235619eac1dbe5f5ec6f8d4c5bd56e6f2e6c28 (1)

bghira commented 1 year ago

pip uninstall diffusers
pip install -U git+https://github.com/bghira/diffusers@partial-diffusion

For a new venv:

pip install -U git+https://github.com/bghira/diffusers@partial-diffusion
pip install transformers accelerate safetensors
pip install "numpy>=1.17" "PyWavelets>=1.1.1" "opencv-python>=4.1.0.25"
pip install --no-deps invisible-watermark

This python script will demo the changes, with easily tweaked values:

from diffusers import DiffusionPipeline
import torch

pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()

# if using torch < 2.0
# pipe.enable_xformers_memory_efficient_attention()

prompt = "An astronaut riding a green horse"

# We're going to schedule 20 steps, and complete 10 of them.
image = pipe(prompt=prompt, output_type="latent",
              num_inference_steps=20, final_inference_step=10).images

# If you have low vram.
# del pipe
# import gc
# gc.collect()
# torch.clear_cache()
# torch.clear_autocast_cache()

# Put through the refiner now.
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16")
pipe.to("cuda") # OR, pipe.enable_sequential_cpu_offload() OR, pipe.enable_model_cpu_offload()

# if using torch < 2.0
# pipe.enable_xformers_memory_efficient_attention()

# If add_noise is False, we will no longer add noise during Img2Img.
# This is useful since the base details will be preserved. However,
# switching add_noise to True can yield truly interesting results.
images = pipe(prompt=prompt, image=image,
               num_inference_steps=20, begin_inference_step=10).images

sayakpaul commented 1 year ago

API design looks clean to me! Let's maybe try to open a PR?

bghira commented 1 year ago

I might need assistance with the documentation fixes. I am also looking forward to feedback from anyone else who tests this, as they might discover an issue I have not.

Namely, I'm not sure whether I need to subtract 1 from the final timestep. Off-by-one errors are the bane of my existence. All I ask is careful code review, as I'm new to the subject material.

bghira commented 1 year ago

wanted to note that this cuts the generation time on my system (4090, A100) by 14 seconds. from 22-25 seconds, down to about 8-11 seconds for a batch size of 4 at 1152x768.

Birch-san commented 1 year ago

if you'd like to do the same via diffusers + k-diffusion: I made such an ensemble pipeline a couple of days ago (code, release announcement).

I agree that this ensemble-of-expert-denoisers approach should be more efficient than "fully-denoise via base + partial re-denoise via img2img".

gunshi commented 12 months ago

Hi @bghira , I was searching the issues page for an answer for whether we can currently do inference for any pipeline where we want to start from an intermediate timestep (say I add noise to an original image corresponding to timestep 200, then I want to be able to get the SD model's output on this noised image by telling it that it should assume it is currently starting from timestep 800).

As far as I can tell this I currently not possible in the pipeline_stable_diffusion.py since it calls the scheduler.set_timesteps() which itself constructs the time steps array by spacing out the 0-1000 range (for SD) into chunks of num_inference_steps intervals. Your proposal above seems related to this feature, am I correct? If yes could you point me to the relevant block of code where I can isolate the changes required to be able to bootstrap inference from a particular timestep like I'm looking to do?

huggingface / diffusers

[SD-XL] Enhanced inference control #4003

Feature Request: Enhanced Control of the Inference Process in SDXL Pipeline