implement align your steps scheduler

blob42 commented 4 months ago

Signed-off-by: blob42 contact@blob42.xyz

Description

implement Align Your Steps scheduler.

Credit to this PR which I used as base.

Screenshots/videos:

SDXL

xyz_grid-0012-3839149266

xyz_grid-0011-3839149266

SD1.5

xyz_grid-0019-1792503894

Checklist:

[x] I have read contributing wiki page
[x] I have performed a self-review of my own code
[x] My code follows the style guidelines
[x] My code passes tests

dan4ik94 commented 4 months ago

@huchenlei would you merge this please, when you have time.

Panchovix commented 4 months ago

Is there a way to add Restart sampler with this scheduler?

dan4ik94 commented 4 months ago

Is there a way to add Restart sampler with this scheduler?

append to the samplers list in forge_alter_samplers.py, should do the trick

sd_samplers_common.SamplerData('Restart AYS', build_constructor(sampler_name='restart', scheduler_name='ays'), ['restart_ays'], {}),

remJc7DvaO

Panchovix commented 4 months ago

@dan4ik94

Is there a way to add Restart sampler with this scheduler?

append to the samplers list in forge_alter_samplers.py, should do the trick

sd_samplers_common.SamplerData('Restart AYS', build_constructor(sampler_name='restart', scheduler_name='ays'), ['restart_ays'], {}),

Restart AYS doesn't seem to work sadly.

AttributeError: module 'ldm_patched.k_diffusion.sampling' has no attribute 'sample_restart'
module 'ldm_patched.k_diffusion.sampling' has no attribute 'sample_restart'

Koitenshin commented 4 months ago

AYS can look much better than what you have here but someone else needs to do the programming. I tried on my own installation of A1111 but couldn't figure out how.

We need to take the 14.615 sigma and automatically stretch it out over the amount of total steps the user selects. It's extremely rudimentary downsloping but I used the following sigmas for 32 steps (complex prompts require more steps). My image turned out much better using 32 sigmas over 32 steps than 11 sigmas over 32 steps.

sigmas = [14.615, 14.158, 13.702, 13.245, 12.788, 12.331, 11.875, 11.418, 10.961, 10.505, 10.048, 9.591, 9.134, 8.678, 8.221, 7.764, 7.308, 6.851, 6.394, 5.937, 5.481, 5.024, 4.567, 4.110, 3.654, 3.197, 2.740, 2.284, 1.827, 1.370, 0.913, 0.457, 0]

blob42 commented 4 months ago

@Koitenshin I will do some experiments and improved implementation when I have some free time

blob42 commented 4 months ago

@Koitenshin I am not well versed in the math behind sampling schedulers I merely ported the existing A1111 implantation here. It looks like Comfy also uses the same algorithm.

Could you explain the heuristic used to stretch the sigmas over the steps ?

dan4ik94 commented 4 months ago

AYS can look much better than what you have here but someone else needs to do the programming. I tried on my own installation of A1111 but couldn't figure out how.

We need to take the 14.615 sigma and automatically stretch it out over the amount of total steps the user selects. It's extremely rudimentary downsloping but I used the following sigmas for 32 steps (complex prompts require more steps). My image turned out much better using 32 sigmas over 32 steps than 11 sigmas over 32 steps.

sigmas = [14.615, 14.158, 13.702, 13.245, 12.788, 12.331, 11.875, 11.418, 10.961, 10.505, 10.048, 9.591, 9.134, 8.678, 8.221, 7.764, 7.308, 6.851, 6.394, 5.937, 5.481, 5.024, 4.567, 4.110, 3.654, 3.197, 2.740, 2.284, 1.827, 1.370, 0.913, 0.457, 0]

can you show some image examples with 11 and 32 sigmas

Koitenshin commented 4 months ago

@Koitenshin I am not well versed in the math behind sampling schedulers I merely ported the existing A1111 implantation here. It looks like Comfy also uses the same algorithm.

Could you explain the heuristic used to stretch the sigmas over the steps ?

@blob42 No heuristic at all. I simply took 14.615 and divided it into 32 parts. Each section was rounded up or down according to "precision 3", although I wish I could force it to use "precision 10" for even cleaner results.

@dan4ik94 I can try, although some people will argue with the results. 11 Sigmas over 32 Steps looks nice, but it lacks coherence compared to 32 Sigmas over 32 Steps.

First prompt is from here: https://prompthero.com/prompt/cf5ed5a0881 Second prompt is from here: https://prompthero.com/prompt/1107ce59578 Third prompt is from here: https://prompthero.com/prompt/cef4653ee67

Here is a link to the 4 grids for side by side comparison. I used multiple samplers (DPM++ 2S a, DPM2, Euler, & Heun) in different images so you can see better results. 11 Sigmas only performs really well under Heun with complex prompts.

https://imgur.com/a/NQLCD4M

As you can see, Sigmas should be stretched over the amount of steps you use for better prompt coherence.

As for the testing, you will not be able to replicate my results. I'm using a lot of custom forked & edited code that I haven't uploaded to a repo yet, along with 2k generation using a 64k resized seed. I'd use a higher resized seed but 8GB of VRAM on my 3060TI gets maxed out by 64k. I'm also using an SD v1.5 based model for these results, SDXL will have to wait until my setup plays nice with it.

You can test the sigmas yourselves. In ldm_patched\k_diffusion\sampling.py

def get_sigmas_ays_11(n, sigma_min, sigma_max, is_sdxl=False, device='cpu'):
    # https://research.nvidia.com/labs/toronto-ai/AlignYourSteps/howto.html
    def loglinear_interp(t_steps, num_steps):
        """
        Performs log-linear interpolation of a given array of decreasing numbers.
        """
        xs = torch.linspace(0, 1, len(t_steps))
        ys = torch.log(torch.tensor(t_steps[::-1]))

        new_xs = torch.linspace(0, 1, num_steps)
        new_ys = np.interp(new_xs, xs, ys)

        interped_ys = torch.exp(torch.tensor(new_ys)).numpy()[::-1].copy()
        return interped_ys

    if is_sdxl:
        # DEFAULT SDXL SIGMAS #
        sigmas = [14.615, 6.315, 3.771, 2.181, 1.342, 0.862, 0.555, 0.380, 0.234, 0.113, 0.029]
    else:
        # DEFAULT SD 1.5 SIGMAS #
        sigmas = [14.615, 6.475, 3.861, 2.697, 1.886, 1.396, 0.963, 0.652, 0.399, 0.152, 0.029]

    if n != len(sigmas):
        sigmas = np.append(loglinear_interp(sigmas, n), [0.0])
    else:
        sigmas.append(0.0)

    return torch.FloatTensor(sigmas).to(device)

def get_sigmas_ays_32(n, sigma_min, sigma_max, is_sdxl=False, device='cpu'):
    # https://research.nvidia.com/labs/toronto-ai/AlignYourSteps/howto.html
    def loglinear_interp(t_steps, num_steps):
        """
        Performs log-linear interpolation of a given array of decreasing numbers.
        """
        xs = torch.linspace(0, 1, len(t_steps))
        ys = torch.log(torch.tensor(t_steps[::-1]))

        new_xs = torch.linspace(0, 1, num_steps)
        new_ys = np.interp(new_xs, xs, ys)

        interped_ys = torch.exp(torch.tensor(new_ys)).numpy()[::-1].copy()
        return interped_ys

    if is_sdxl:
        # EXTREME PRECISION TEST #
        sigmas = [14.61500000000000000, 11.14916180000000000, 8.505221270000000000, 6.488271510000000000, 5.437074020000000000, 4.603986190000000000, 3.898547040000000000, 3.274074570000000000, 2.743965270000000000, 2.299686590000000000, 1.954485140000000000, 1.671087150000000000, 1.428781520000000000, 1.231810090000000000, 1.067896490000000000, 0.925794430000000000, 0.802908860000000000, 0.696601210000000000, 0.604369030000000000, 0.528525520000000000, 0.467733440000000000, 0.413933790000000000, 0.362581860000000000, 0.310085170000000000, 0.265189250000000000, 0.223264610000000000, 0.176538770000000000, 0.139591920000000000, 0.105873810000000000, 0.055193690000000000, 0.028773340000000000, 0.015000000000000000]
    else:
        # EXTREME PRECISION TEST #
        sigmas = [14.61500000000000000, 11.23951352000000000, 8.643630810000000000, 6.647294240000000000, 5.572508620000000000, 4.716485460000000000, 3.991960650000000000, 3.519560900000000000, 3.134904660000000000, 2.792287880000000000, 2.487736280000000000, 2.216638650000000000, 1.975083510000000000, 1.779317200000000000, 1.614753350000000000, 1.465409530000000000, 1.314849000000000000, 1.166424970000000000, 1.034755470000000000, 0.915737440000000000, 0.807481690000000000, 0.712023610000000000, 0.621739000000000000, 0.530652020000000000, 0.452909600000000000, 0.374914550000000000, 0.274618190000000000, 0.201152900000000000, 0.141058730000000000, 0.066828810000000000, 0.031661210000000000, 0.015000000000000000]

    if n != len(sigmas):
        sigmas = np.append(loglinear_interp(sigmas, n), [0.0])
    else:
        sigmas.append(0.0)

    return torch.FloatTensor(sigmas).to(device)

In ldm_patched\modules\samplers.py around lines 666-680 add the following:

    elif scheduler_name == "ays_11":
        sigmas = k_diffusion_sampling.get_sigmas_ays_11(n=steps, sigma_min=float(model.model_sampling.sigma_min), sigma_max=float(model.model_sampling.sigma_max), is_sdxl=is_sdxl)
    elif scheduler_name == "ays_32":
        sigmas = k_diffusion_sampling.get_sigmas_ays_32(n=steps, sigma_min=float(model.model_sampling.sigma_min), sigma_max=float(model.model_sampling.sigma_max), is_sdxl=is_sdxl)

And in modules_forge\forge_alter_samplers.py add the following:

    sd_samplers_common.SamplerData('Euler AYS 11', build_constructor(sampler_name='euler', scheduler_name='ays_11'), ['euler_ays_11'], {}),
    sd_samplers_common.SamplerData('Euler A AYS 11', build_constructor(sampler_name='euler_ancestral', scheduler_name='ays_11'), ['euler_ancestral_ays_11'], {}),
    sd_samplers_common.SamplerData('DPM++ 2M AYS 11', build_constructor(sampler_name='dpmpp_2m', scheduler_name='ays_11'), ['dpmpp_2m_ays_11'], {}),
    sd_samplers_common.SamplerData('DPM++ 2M SDE AYS 11', build_constructor(sampler_name='dpmpp_2m_sde', scheduler_name='ays_11'), ['dpmpp_2m_sde_ays_11'], {}),
    sd_samplers_common.SamplerData('Euler AYS 32', build_constructor(sampler_name='euler', scheduler_name='ays_32'), ['euler_ays_32'], {}),
    sd_samplers_common.SamplerData('Euler A AYS 32', build_constructor(sampler_name='euler_ancestral', scheduler_name='ays_32'), ['euler_ancestral_ays_32'], {}),
    sd_samplers_common.SamplerData('DPM++ 2M AYS 32', build_constructor(sampler_name='dpmpp_2m', scheduler_name='ays_32'), ['dpmpp_2m_ays_32'], {}),
    sd_samplers_common.SamplerData('DPM++ 2M SDE AYS 32', build_constructor(sampler_name='dpmpp_2m_sde', scheduler_name='ays_32'), ['dpmpp_2m_sde_ays_32'], {}),

blob42 commented 4 months ago

I will do some experiments in a separate branch it should be fairly easy to update the code to dynamically split the sigmas.

As for the testing, you will not be able to replicate my results. I'm using a lot of custom forked & edited code that I haven't uploaded to a repo yet, along with 2k generation using a 64k resized seed.

You got me curious about this, what do you mean with 2k gen using 64k seed resize ?

Koitenshin commented 4 months ago

I will do some experiments in a separate branch it should be fairly easy to update the code to dynamically split the sigmas.

As for the testing, you will not be able to replicate my results. I'm using a lot of custom forked & edited code that I haven't uploaded to a repo yet, along with 2k generation using a 64k resized seed.

You got me curious about this, what do you mean with 2k gen using 64k seed resize ?

Every single small image on those xyz grids is a 1080x1920 portrait, you can blow them all up on the Imgur album to view them in their entirety.

There's a checkbox labeled 'Extra' next to the Seed on the A1111 interface. On this subpanel you have two options: Resize seed from width & Resize seed from height. I have the options set to 34560 x 61440 (64K, as in 4K Ultra HD, 8K, 16K, etc). The higher you can set this number, the more detail you can draw out of your image.

I'm talking pure generation, no complicated workflows, inpainting, loras, negative prompts, upscaling, control nets, or whatever else hasn't been thought of yet.

Even when A1111 first came out I was generating 1080p (without hires fix) because I modified the source code and fixed the tensor math. I can't remember what I modified but the issue lies in processing.py. These days I just use a modified version of Kohya's Hi Res Fix by wcde because it gives me even more options for scaling than I had doing the aforementioned modification.

EDIT: SDXL testing happened, and only thanks to Forge and NeverOOM. 32 Sigmas are still better than 11 stretched over 32 steps. Forge really needs to split the samplers and schedulers like A1111 ASAP. It would make testing this stuff so much easier. New sigmas are edited into the above code post for both SD 1.5 & SDXL testing.

lllyasviel / stable-diffusion-webui-forge