alexblattner commented 1 year ago

hi, does diffusers have this: https://github.com/Kahsolt/stable-diffusion-webui-hires-fix-progressive

I realized that there's a lot of other features that I was not aware of when seeing this comment about how to get high quality results: https://www.reddit.com/r/StableDiffusion/comments/13gr5rg/comment/jk2g3vd/?utm_source=share&utm_medium=web2x&context=3

what would be the equivalent in diffusers?

patrickvonplaten commented 1 year ago

Hey @alexblattner,

I think you can build this already with diffusers, something like:

import torch
from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline

text2img = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
img2img = StableDiffusionImg2ImgPipeline(**text2img.components)

Now we generate an image and output the latent:

image = text2img("a prompt", num_inference_steps=20, output_type="pt")   # or output_type="latent"

upscaled_image = upscale(image)

upscaled_image = img2img("a prompt", num_inference_steps=20, strength=0.5, output_type="pt")

Could you try this ? Play around with this a bit?

Also cc @yiyixuxu, we should make sure that the Img2Img pipeline can accept latents as inputs.

alexblattner commented 1 year ago

@patrickvonplaten won't using multiple pipes be really heavy though (a single pipe is a couple GBs)? Also, where is the upscale function coming from? my pipeline doesn't use traditional prompts. it instead uses an array of prompts for generating stuff in multiple different areas. Would I need a custom img2img then? Also, your img2img pipeline doesn't seem to be receiving an image.

it would be my pleasure to confirm things for you guys

patrickvonplaten commented 1 year ago

The upscale function is just a simple PIL resize function: https://www.geeksforgeeks.org/python-pil-image-resize-method/

The img2img pipeline indeed takes an image as an input (sorry this was wrong above).

upscaled_image = img2img("a prompt", image=upscaled_image, num_inference_steps=20, strength=0.5, output_type="pt")

Also the pipelines share the same components, so memory is shared

alexblattner commented 1 year ago

@patrickvonplaten thanks for the reply. I tried running the equivalent of this: img2img = StableDiffusionImg2ImgPipeline(**text2img.components) it didn't work: /mecomics-api/multiDiffusion.py:157 in init │ │ │ │ 154 │ ): │ │ 155 │ │ super().init() │ │ 156 │ │ │ │ ❱ 157 │ │ if hasattr(scheduler.config, "steps_offset") and scheduler.config.steps_offset ! │ │ 158 │ │ │ deprecation_message = ( │ │ 159 │ │ │ │ f"The configuration file of this scheduler: {scheduler} is outdated. `st │ │ 160 │ │ │ │ f" should be set to 1 instead of {scheduler.config.steps_offset}. Please │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ AttributeError: 'tuple' object has no attribute 'config'

The upscale function is just a simple PIL resize function: https://www.geeksforgeeks.org/python-pil-image-resize-method/

is that enough? there are many custom upscalers such as foolhardi remacri that do a great job apparently. What is the logic behind passing a bigger image to img2img for the final picture? Will it achieve the same level of quality as an upscaler?

alexisrolland commented 1 year ago

I was interested in reproducing "Highres Fix" from A1111 as well so I have tried your suggestion @patrickvonplaten. Here are the results of the tests.

With `output_type='latent'`

# Generate image first
latent_images = pipeline(
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_prompt_embeds,
    width=width,
    height=height,
    num_inference_steps=steps,
    guidance_scale=guidance,
    num_images_per_prompt=1,
    generator=generators,
    output_type='latent'
)

# Upscale intermediary result
latent_images.images[0] = latent_images.images[0].resize((width*4, height*4))

# Regenerate result with image2image pipeline
result = img2img_pipeline(
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_prompt_embeds,
    image=latent_images.images[0],
    strength=0.5,
    num_inference_steps=steps,
    guidance_scale=guidance,
    num_images_per_prompt=1,
    generator=generators
)

# Downscale result back to expected size
result.images[0] = result.images[0].resize((width, height))

This returned the following error message:

RuntimeError("requested resize to (2048, 3072) ((2048, 3072) elements in total), but the given tensor has a size of 4x96x64 (24576 elements). autograd's resize can only change the shape of a given tensor, while preserving the number of elements. ")

I have also tried to reshape the tensor with torchvision

import torchvision.transforms as T

[...]
transform = T.Resize(size=(width*4, height*4))
upscaled_image = transform(latent_images.images[0])

This returned the following error message:

RuntimeError('Given groups=1, weight of size [128, 3, 3, 3], expected input[1, 4, 2048, 3072] to have 3 channels, but got 4 channels instead')

With default `output_type`

I have tried with the default output type. Same code as above but I removed output_type='latent' and I used latent_images.images[0].resize((width*4, height*4)) to upscale the intermediary result. The pipelines run fine and while the face of the character I generated definitely got better, the whole image got a lot more blurry. Any trick to avoid the blurry effect?

Example without the double inference: 9d8489cf-5f3a-4ecf-ac0c-10cf40f40ed4

Example with the double inference: 5f943cf9-7e26-4b2c-a1cf-5289390562d8

alexblattner commented 1 year ago

@alexisrolland I think the best approach would be to use the upscale pipeline

alexisrolland commented 1 year ago

@alexisrolland I think the best approach would be to use the upscale pipeline

I don't think using any ML-based upscaling model would make a significant difference in this process. The problem is not with resizing the image but rather with the img2img pipeline that gives somehow blurred results.

alexisrolland commented 1 year ago

I have tried without output_type='latent' and by upscaling the image x2 instead of x4 to avoid the deformities. It still works but the output remains blurry...

# Generate image first
latent_images = pipeline(
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_prompt_embeds,
    width=width,
    height=height,
    num_inference_steps=steps,
    guidance_scale=guidance,
    num_images_per_prompt=1,
    generator=generators
)

# Upscale intermediary result
latent_images.images[0] = latent_images.images[0].resize((width*2, height*2))

# Regenerate result with image2image pipeline
result = img2img_pipeline(
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_prompt_embeds,
    image=latent_images.images[0],
    strength=0.5,
    num_inference_steps=steps,
    guidance_scale=guidance,
    num_images_per_prompt=1,
    generator=generators
)

# Downscale result back to expected size
result.images[0] = result.images[0].resize((width, height))

94d10e84-67fe-44a4-934a-2a9a27e4a32a

mimhle commented 1 year ago

I have try this and have great success with Real-ESRGAN upscaler

Base image:

After upscaled:

Img2img:

alexblattner commented 1 year ago

@mimhle how do you use that upscaler? do you know how to use foolhardi remacri by any chance?

alexblattner commented 1 year ago

@mimhle can you show your code by any chance?

mimhle commented 1 year ago

This is a snipet of my code:

# generate base image
pre_fix_imgs = tuple(txt2img(
  prompt,
  negative_prompt,
  image_num,
  width,
  height,
  scheduler,
  num_inference_steps,
  guidance_scale,
  noise_strength,
  initial_seed,
  clip_skip,
))

# enable tilling for reduce ram usage
pipe.enable_vae_tiling()

# increase the size
width = int(width * hires_scale)
height = int(height * hires_scale)

# get the seeds and images
pre_fix_imgs, seeds = [i[0] for i in pre_fix_imgs], [i[1] for i in pre_fix_imgs]

# upscale the images
pre_fix_imgs = upscale(
  pre_fix_imgs,
  model_name = "RealESRGAN_x4plus",
  scale_factor = hires_scale,
  half_precision = False,
  tile = 700,
)

# run img2img
result = []
for i, img in enumerate(pre_fix_imgs):
  result.extend(txt2img(
    prompt,
    negative_prompt,
    1,
    width,
    height,
    scheduler,
    num_inference_steps,
    guidance_scale,
    noise_strength,
    seeds[i],
    clip_skip,
    img,
  ))

where the txt2img and the upscaler function is just a wrapper for the diffusers pipeline and the Real-ESRGAN upscaler respectively

and about foolhardi remacri, i have no experience using that upscaler

alexblattner commented 1 year ago

thanks @mimhle would you mind sharing the upscale function too?

mimhle commented 1 year ago

Part of my code was taken from the github repo from xintao

from realesrgan import RealESRGANer
from basicsr.archs.rrdbnet_arch import RRDBNet

def factorize(num: int, max_value: int) -> list[float]:
  result = []
  while num > max_value:
    result.append(max_value)
    num /= max_value
  result.append(round(num, 4))
  return result

def upscale(
    imgs: list[PIL.Image.Image],
    model_name: str = "RealESRGAN_x4plus",
    scale_factor: float = 4,
    half_precision: bool = False,
    tile: int = 0,
    tile_pad: int = 10,
    pre_pad: int = 0,
) -> list[PIL.Image.Image]:

  # check model
  if model_name == "RealESRGAN_x4plus":
    upscale_model = RRDBNet(num_in_ch=3, num_out_ch=3, num_feat=64, num_block=23, num_grow_ch=32, scale=4)
    netscale = 4
    file_url = "https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.0/RealESRGAN_x4plus.pth"
  elif model_name == "RealESRNet_x4plus":
    upscale_model = RRDBNet(num_in_ch=3, num_out_ch=3, num_feat=64, num_block=23, num_grow_ch=32, scale=4)
    netscale = 4
    file_url = "https://github.com/xinntao/Real-ESRGAN/releases/download/v0.1.1/RealESRNet_x4plus.pth"
  elif model_name == "RealESRGAN_x4plus_anime_6B":
    upscale_model = RRDBNet(num_in_ch=3, num_out_ch=3, num_feat=64, num_block=6, num_grow_ch=32, scale=4)
    netscale = 4
    file_url = "https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.2.4/RealESRGAN_x4plus_anime_6B.pth"
  elif model_name == "RealESRGAN_x2plus":
    upscale_model = RRDBNet(num_in_ch=3, num_out_ch=3, num_feat=64, num_block=23, num_grow_ch=32, scale=2)
    netscale = 2
    file_url = "https://github.com/xinntao/Real-ESRGAN/releases/download/v0.2.1/RealESRGAN_x2plus.pth"
  else:
    raise NotImplementedError("Model name not supported")

  # download model
  model_path = download_file(file_url, path="./upscaler-model", progress=False, interrupt_check=False)

  # declare the upscaler
  upsampler = RealESRGANer(
    scale=netscale,
    model_path=os.path.join("./upscaler-model", model_path),
    dni_weight=None,
    model=upscale_model,
    tile=tile,
    tile_pad=tile_pad,
    pre_pad=pre_pad,
    half=half_precision,
    gpu_id=None
  )

  # upscale
  torch.cuda.empty_cache()
  upscaled_imgs = []
  with tqdm(total=len(img_list)) as pb:
    for i, img in enumerate(img_list):
      img = numpy.array(img)
      outscale_list = factorize(scale_factor, netscale)
      with contextlib.redirect_stdout(StringIO()):
        for outscale in outscale_list:
          curr_img = upsampler.enhance(img, outscale=outscale)[0]
          img = curr_img
        upscaled_imgs.append(Image.fromarray(img))

      pb.update(1)
  torch.cuda.empty_cache()

  return upscaled_imgs

for the download function you can write it yourself or check out mine: https://gist.github.com/mimhle/fbad13f046a72ee7a683d28218650ef0

alexblattner commented 1 year ago

thanks a lot @mimhle! could you explain what those parameters do: scale_factor: float = 4, half_precision: bool = False, tile: int = 0, tile_pad: int = 10, pre_pad: int = 0

I assume scale factor is by how much we want to increase the image size which in the default case would multiply the height and width by 4, right?

mimhle commented 1 year ago

yes

scale_factor is the upscale multiplier, my function allow you to scale the image higher than 4 (the model limit) by upscale the image multiple times
half_precision is whether you want to use 16-bit for faster speed or 32-bit for accuracy
tile is the tile size, the bigger the tile, the faster the function will run but at a cost of higher memory usage, 0 to not use this feature
tile_pad is the pad size for each tile, higher to remove border artifacts
pre_pad is the pad size for the whole image

alexblattner commented 1 year ago

thanks @mimhle that was very useful. I'll check tomorrow to see how it works with remacri

yujack333 commented 1 year ago

This is a snipet of my code:

# generate base image
pre_fix_imgs = tuple(txt2img(
  prompt,
  negative_prompt,
  image_num,
  width,
  height,
  scheduler,
  num_inference_steps,
  guidance_scale,
  noise_strength,
  initial_seed,
  clip_skip,
))

# enable tilling for reduce ram usage
pipe.enable_vae_tiling()

# increase the size
width = int(width * hires_scale)
height = int(height * hires_scale)

# get the seeds and images
pre_fix_imgs, seeds = [i[0] for i in pre_fix_imgs], [i[1] for i in pre_fix_imgs]

# upscale the images
pre_fix_imgs = upscale(
  pre_fix_imgs,
  model_name = "RealESRGAN_x4plus",
  scale_factor = hires_scale,
  half_precision = False,
  tile = 700,
)

# run img2img
result = []
for i, img in enumerate(pre_fix_imgs):
  result.extend(txt2img(
    prompt,
    negative_prompt,
    1,
    width,
    height,
    scheduler,
    num_inference_steps,
    guidance_scale,
    noise_strength,
    seeds[i],
    clip_skip,
    img,
  ))

where the txt2img and the upscaler function is just a wrapper for the diffusers pipeline and the Real-ESRGAN upscaler respectively

and about foolhardi remacri, i have no experience using that upscaler

would you mind sharing the text2img function too?

mimhle commented 1 year ago

my code is currently a mess but here you go:


def txt2img(
    prompt: str = "",
    negative_prompt: str = "",
    image_num: int = 1,
    width: int = 512,
    height: int = 512,
    scheduler: str = "DDIMScheduler",
    num_inference_steps: int = 50,
    guidance_scale: float = 7.5,
    noise_strength: float = 0.6,
    initial_seed: int = -1,
    clip_skip: int = 1,
    image: PIL.Image.Image | None = None,
) -> list[tuple[PIL.Image.Image, str]]:

  # declare variables
  global pipe
  all_images = []

  # check image for img2img or inpainting
  if isinstance(image, dict):
    image, mask = list(image.values())
    image = image.convert('RGB')
    mask = mask.convert('RGB')
    if is_all_black(mask):
      mask = None
  else:
    mask = None

  # check for controlnet
  is_controlnet = getattr(pipe, "controlnet", False)

  # import scheduler
  if "Karras" in scheduler:
    scheduler = scheduler.replace("Karras", '')
    scheduler_karras_sigmas = True
  else:
    scheduler_karras_sigmas = False
  exec(f"from diffusers import {scheduler}")
  if scheduler_karras_sigmas:
    exec(f"pipe.scheduler = {scheduler}.from_config(pipe.scheduler.config, use_karras_sigmas=True)")
  else:
    exec(f"pipe.scheduler = {scheduler}.from_config(pipe.scheduler.config)")

  # generate seeds
  if initial_seed < 0:
    random.seed()
    initial_seed = random.randint(0, 18446744073709551615)
    random.seed(initial_seed)
  else:
    random.seed(initial_seed)
  seeds = [initial_seed]
  seeds.extend([random.randint(0, 18446744073709551615) for _ in range(image_num - 1)])
  generator = [torch.Generator(device=pipe.device.type).manual_seed(seed) for seed in seeds]

  # inference
  torch.cuda.empty_cache()
  for i in range(image_num):
    # resize image
    if is_controlnet or image:
      image = crop_to_nearest_multiple(image, 8)
      width, height = image.size
    # check controlnet
    if is_controlnet:
      images = pipe(
        prompt=prompt, 
        negative_prompt=negative_prompt, 
        width=width, 
        height=height, 
        num_inference_steps=num_inference_steps, 
        guidance_scale=guidance_scale, 
        generator=generator[i], 
        num_images_per_prompt=1,
        clip_skip = clip_skip,
        image=image,
      ).images
    else:
      images = pipe(
        prompt=prompt, 
        negative_prompt=negative_prompt, 
        width=width, 
        height=height, 
        num_inference_steps=num_inference_steps, 
        guidance_scale=guidance_scale, 
        generator=generator[i], 
        num_images_per_prompt=1,
        clip_skip = clip_skip,
        max_embeddings_multiples=5,
        image=image,
        mask_image=mask,
        strength=noise_strength,
      ).images

    all_images.extend(images)

  torch.cuda.empty_cache()

  # return
  return zip(all_images, map(str, seeds))

(the pipeline i use is a modified lpw pipeline that support txt2img, img2img and inpainting in one pipeline so if you use compel or other pipeline you may want to modify my code to check for the inference type)

yujack333 commented 1 year ago

my code is currently a mess but here you go:

def txt2img(
    prompt: str = "",
    negative_prompt: str = "",
    image_num: int = 1,
    width: int = 512,
    height: int = 512,
    scheduler: str = "DDIMScheduler",
    num_inference_steps: int = 50,
    guidance_scale: float = 7.5,
    noise_strength: float = 0.6,
    initial_seed: int = -1,
    clip_skip: int = 1,
    image: PIL.Image.Image | None = None,
) -> list[tuple[PIL.Image.Image, str]]:

  # declare variables
  global pipe
  all_images = []

  # check image for img2img or inpainting
  if isinstance(image, dict):
    image, mask = list(image.values())
    image = image.convert('RGB')
    mask = mask.convert('RGB')
    if is_all_black(mask):
      mask = None
  else:
    mask = None

  # check for controlnet
  is_controlnet = getattr(pipe, "controlnet", False)

  # import scheduler
  if "Karras" in scheduler:
    scheduler = scheduler.replace("Karras", '')
    scheduler_karras_sigmas = True
  else:
    scheduler_karras_sigmas = False
  exec(f"from diffusers import {scheduler}")
  if scheduler_karras_sigmas:
    exec(f"pipe.scheduler = {scheduler}.from_config(pipe.scheduler.config, use_karras_sigmas=True)")
  else:
    exec(f"pipe.scheduler = {scheduler}.from_config(pipe.scheduler.config)")

  # generate seeds
  if initial_seed < 0:
    random.seed()
    initial_seed = random.randint(0, 18446744073709551615)
    random.seed(initial_seed)
  else:
    random.seed(initial_seed)
  seeds = [initial_seed]
  seeds.extend([random.randint(0, 18446744073709551615) for _ in range(image_num - 1)])
  generator = [torch.Generator(device=pipe.device.type).manual_seed(seed) for seed in seeds]

  # inference
  torch.cuda.empty_cache()
  for i in range(image_num):
    # resize image
    if is_controlnet or image:
      image = crop_to_nearest_multiple(image, 8)
      width, height = image.size
    # check controlnet
    if is_controlnet:
      images = pipe(
        prompt=prompt, 
        negative_prompt=negative_prompt, 
        width=width, 
        height=height, 
        num_inference_steps=num_inference_steps, 
        guidance_scale=guidance_scale, 
        generator=generator[i], 
        num_images_per_prompt=1,
        clip_skip = clip_skip,
        image=image,
      ).images
    else:
      images = pipe(
        prompt=prompt, 
        negative_prompt=negative_prompt, 
        width=width, 
        height=height, 
        num_inference_steps=num_inference_steps, 
        guidance_scale=guidance_scale, 
        generator=generator[i], 
        num_images_per_prompt=1,
        clip_skip = clip_skip,
        max_embeddings_multiples=5,
        image=image,
        mask_image=mask,
        strength=noise_strength,
      ).images

    all_images.extend(images)

  torch.cuda.empty_cache()

  # return
  return zip(all_images, map(str, seeds))

(the pipeline i use is a modified lpw pipeline that support txt2img, img2img and inpainting in one pipeline so if you use compel or other pipeline you may want to modify my code to check for the inference type)

Thanks a lot, it's really helpful. Under the guidance of your code, I did enlarge and add a lot of details to the image. But i can't upscale my image by 2 or 4 times which limit by CUDA memory at "img2img" step even i used the pipe.enable_vae_tiling(). I used this pipline: https://github.com/huggingface/diffusers/blob/main/examples/community/lpw_stable_diffusion.py#L412 which include txt2img, img2img . and i found i can only controle the upscale size by input image size(can't controle by input width and height ) at img2img step. please help me.

mimhle commented 1 year ago

@yujack333 try enabling attention slicing with pipe.enable_attention_slicing() or xformers memory efficient attention with pipe.enable_xformers_memory_efficient_attention() to see if it helps.

alexisrolland commented 1 year ago

Hi @mimhle

I find I get varying results on the second diffusion pass with img2img, depending on the values I provide in the noise strength, steps, scheduler... Which values do you use? Do you reuse the same as txt2img? Would you have any recommendation please? I still have some blurry output for some reason...

I tried:

Setting noise strength to 0.3 - 0.4
Setting the steps to 3x the steps of the txt2img

It improved a bit the results but there's still some blur

mimhle commented 1 year ago

@alexisrolland this is my setting (everything the same for both step):

model: Counterfeit-V3.0
prompt: (masterpiece, high quality, Amazing Details:1.2), (animal), cat, closed eyes, closed mouth, feathers, flower, ((no humans)), scarf, solo, star symbol, starry background, heart tail
scheduler: EulerAncestralDiscreteScheduler
num_inference_steps: 20
guidance_scale: 10
noise_strength: 0.5

All i can think the problem here is the scheduler since some can have better result with background clarity. (Also maybe higher noise strength: 0.6-0.8?) Can you share your setting for me to try it on my code?

alexisrolland commented 1 year ago

Thanks @mimhle Here are the settings I'm using:

Model: RPG v4.0
Prompt: a mysterious dark scary menacing young japanese female shinobi ninja wearing a hood, menacing mysterious expression, style chiaroscuro Artwork Oil Paint, Moody Lighting, Rembrandt Lighting, Oil painting by Edouart Bisson, , in a dark japanese forest at night, gloomy dark, obscurity, under exposed, dark photography
Negative Prompt: cropped, out of frame, worst quality, low quality, jpeg artifacts, ugly, duplicate, morbid, mutilated, extra fingers, mutated hands, poorly drawn hands, poorly drawn face, mutation, deformed iris, pupils, semi-realistic, text, deformed, blurry, dehydrated, bad anatomy, bad proportions, extra limbs, cloned face, disfigured, gross proportions, canvas frame, bad art, weird colors, malformed limbs, missing arms, missing legs, extra arms, extra legs, fused fingers, too many fingers, long neck, tiling, poorly drawn feet, mutated, cross-eye, body out of frame, nude, naked, watermark, blurred
Width: 512
Height: 768
Steps: 20
Guidance: 6
Scheduler: UNIPC
Seed: -6394085300

Then for the second img2img pass, I found I get better result with this:

Strength: 0.4
Steps: 60

mimhle commented 1 year ago

@alexisrolland Same problem here. I think this has something to do with the model itself.

zwj536 commented 1 year ago

I have try this and have great success with Real-ESRGAN upscaler

Base image:

After upscaled:

Img2img:

Hi, @mimhle does the outputs of your code have the equivalent to Hires.fix from A1111?

mimhle commented 1 year ago

Sorry, @zwj536, I currently don't have time to test this on A1111's web UI (but i think there will be some differences between the two).

alexblattner commented 1 year ago

@mimhle just so you know, this exists: https://github.com/ai-forever/Real-ESRGAN it is significantly easier to use and does the same thing

marcelogdeandrade commented 1 year ago

I think the solution for the initial question change a bit. Using a ESRGAN upscaler is quite different than the hires fix.

Hey @alexblattner,

I think you can build this already with diffusers, something like:
import torch
from diffusers import StableDiffusionPipeline, StableDiffusionImg2ImgPipeline

text2img = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
img2img = StableDiffusionImg2ImgPipeline(**text2img.components)
Now we generate an image and output the latent:
image = text2img("a prompt", num_inference_steps=20, output_type="pt")   # or output_type="latent"

upscaled_image = upscale(image)

upscaled_image = img2img("a prompt", num_inference_steps=20, strength=0.5, output_type="pt")
Could you try this ? Play around with this a bit?

Also cc @yiyixuxu, we should make sure that the Img2Img pipeline can accept latents as inputs.

@patrickvonplaten I've tried going into this direction, but how should I upscale the latent output? It is tensor, not a PIL image.

patrickvonplaten commented 1 year ago

Think the upscale pipeline should accept latents

yiyixuxu commented 1 year ago

@patrickvonplaten will refactor

marcelogdeandrade commented 1 year ago

Think the upscale pipeline should accept latents

So the suggestion is to apply the StableDiffusionLatentUpscalePipeline between the txt2img and img2img? Isn't it a pipeline a lot more complex than just resizing an image?

jelling commented 1 year ago

I haven't tested it but this project says they match the Automatic1111 High-Res fix using diffusers img2img: https://github.com/keisuke-okb/S2D2/tree/main

@alexblattner to confirm, you're saying that the Real-ESRGAN is easier to use then any of the other suggested options? It looks super easy, just confused as to why anyone would use something else but perhaps I'm missing something.

alexblattner commented 1 year ago

@jelling I'll be honest with you, I don't understand the reasons why people would use hires fix over just upscale. it looks the same to me. I kind of gave up because of that. I hope someone gives an answer beyond just regular upscaling.

jelling commented 1 year ago

@alexblattner appreciate the candor.

Having spent more time with Automatic and dug into the codebase, here is what I think is happening for anyone joining the thread:

OP asked for upscaling ala Automatic
Automatic's "high res fix" was originally just a simple latent scaling method
HF is adding support for upscaling images via latents because it's not a big change
Automatic has since added support for upscaling using R-ESRGAN and other methods
R-ESRGAN and other methods are typically entirely additional models

If the above is correct, anyone wanting more advanced upscaling methods than the latent method should just run the output from diffusers through their chosen upscaling model (ex. R-ESRGAN).

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

wilfrediscoming commented 1 year ago

@alexblattner appreciate the candor.

Having spent more time with Automatic and dug into the codebase, here is what I think is happening for anyone joining the thread:

OP asked for upscaling ala Automatic

Automatic's "high res fix" was originally just a simple latent scaling method

HF is adding support for upscaling images via latents because it's not a big change

Automatic has since added support for upscaling using R-ESRGAN and other methods

R-ESRGAN and other methods are typically entirely additional models

If the above is correct, anyone wanting more advanced upscaling methods than the latent method should just run the output from diffusers through their chosen upscaling model (ex. R-ESRGAN).

How to run upscaling model (ex. R-ESRGAN) using diffusers?

patrickvonplaten commented 1 year ago

R-ESRGAN is not in diffusers since it's a GAN model

n00mkrad commented 1 year ago

So, is there a method to generate an image, upscale the latents, and feed it into img2img, like the original highres fix?

Like OP, I do NOT want a separate upscaling model.

wilfrediscoming commented 1 year ago

So, is there a method to generate an image, upscale the latents, and feed it into img2img, like the original highres fix?

Like OP, I do NOT want a separate upscaling model.

If you upscale the latent you will need to use latentupscalar pipeline

n00mkrad commented 1 year ago

But that is locked to 2x scaling and requires a separate model. It's a different technique.

sarmientoj24 commented 1 year ago

@patrickvonplaten where is the updated documentation for the hi_res.fix of A1111 in diffusers?

sergeykorablin commented 1 year ago

After some experiments I found good pipeline for myself.

Image created by text2img model: Image from text2img

Text2img scaled by x2 ESRGAN (@mimhle code) Text2img -> x2 ESRGAN

ESRGAN image repainted without scaling by Img2img with the same prompt Text2img -> x2 ESRGAN -> Img2img I'm happy with final result

patrickvonplaten commented 1 year ago

Can you just use the R-ESRGAN model individually afterward?

alexblattner commented 1 year ago

@jelling I'm aware of upscale scale, denoising strenght and hires steps. Do you happen to know how to replicate those parameters?

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

alexblattner commented 7 months ago

the way hires fix works is that you need to create the image at a low resolution, then upscale it with an esrgan upscaler (shrink the result if necessary), then use i2i on it

plenoi commented 7 months ago

@alexblattner in img2img, how can you fix CUDA memory issue?

plenoi commented 7 months ago

@mimhle i have a problem when i do img2img , when width & height is change , it change to different image. what is the pipe that u use?

Load main model

pipe = diffusers.StableDiffusionPipeline.from_single_file( model, torch_dtype = torch_dtype, safety_checker = None ).to("cuda")

mimhle commented 7 months ago

@plenoi i use the lpw custom pipline, but i dont think that is the case here, when you do i2i the width and height of the input image should be the same as the output one, what do you mean by the size changes

huggingface / diffusers

does diffusers have the equivalent to hires fix from A1111? #3429

With `output_type='latent'`

With default `output_type`

Load main model

huggingface / diffusers

does diffusers have the equivalent to hires fix from A1111? #3429

With output_type='latent'

With default output_type

Load main model

With `output_type='latent'`

With default `output_type`