vitoplantamura / OnnxStream

Lightweight inference library for ONNX files, written in C++. It can run SDXL on a RPI Zero 2 but also Mistral 7B on desktops and servers.
Other
1.79k stars 76 forks source link

SDXL turbo/SDXL importing #44

Closed AeroX2 closed 6 months ago

AeroX2 commented 7 months ago

Hello,

I've been recently trying to import SDXL turbo into OnnxStream, as far as I understand the model architecture is the same just the training weights are different.

I've been trying a few different things to just import just the normal SDXL model just to confirm everything works (optimum export, torch.onnx.export, different python versions, modifying onnx2txt, etc) however a different error each time and no luck so far.

Would you be able to provide a script or the method you took to convert a SDXL model to Onnx? Also the Python version and the versions of the libraries used would also greatly help.

Thanks

vitoplantamura commented 7 months ago

hi,

regarding the UNET model, you could try running this code in a Notebook, then running OnnxSimplifier on the result and then running my onnx2txt:

from diffusers import StableDiffusionXLPipeline
import torch
from torch.onnx import export
import torch.nn as nn

class UNetModel(nn.Module):
    def __init__(self, unet):
        super(UNetModel, self).__init__()
        self.unet = unet
    def forward(self, sample, timestep, encoder_hidden_states, text_embeds, time_ids):
        out_sample = self.unet(return_dict=False,
            sample=sample, timestep=timestep, encoder_hidden_states=encoder_hidden_states,
            added_cond_kwargs={ "text_embeds":text_embeds, "time_ids":time_ids })
        return out_sample

with torch.no_grad():

    pipe = StableDiffusionXLPipeline.from_pretrained(
        "stabilityai/stable-diffusion-xl-base-1.0", use_safetensors=True
    )

    dummy_input = ( torch.randn(1, 4, 128, 128), torch.randn(1),
        torch.randn(1, 77, 2048),
        torch.randn(1, 1280), torch.randn(1, 6))
    input_names = [ "sample", "timestep",
        "encoder_hidden_states",
        "text_embeds", "time_ids" ]
    output_names = [ "out_sample" ]

    torch.onnx.export(UNetModel(pipe.unet), dummy_input, "/home/vito/Downloads/sdxl_unet.onnx", verbose=False,
        input_names=input_names, output_names=output_names,
        opset_version=14, do_constant_folding=True)

Let me know,

Thanks, Vito

AeroX2 commented 7 months ago

Thanks this helped me a lot, I was able to port over SDXLTurbo, you can view the huggingface model here https://huggingface.co/AeroX2/stable-diffusion-xl-turbo-1.0-onnxstream OnnxStream needs some small modifications to sd.cpp which I've made here https://github.com/AeroX2/OnnxStream

Still needs some more work though, I've been trying with one of the example prompts A cinematic shot of a baby racoon wearing an intricate italian priest robe. And it gets it seems to ignore the racoon part

vitoplantamura commented 7 months ago

this is extremely cool!

I plan to try your fork ASAP.

Would you consider creating a PR?

Thanks, Vito

AeroX2 commented 7 months ago

Yep that is the plan, I'll create a PR once I get some time to have a look and fix the following issues.

I've figured out that the racoon issue was due to a misspelling, it should be raccoon (perhaps an issue with OnnxStream's tokenizer) I'm currently trying to figure out why the output is different for OnnxStream compared to diffusers, I suspect the VAE has some issue but still looking into it.

lustfeind commented 7 months ago

Sorry AeroX2 for asking.. I compiled it without errors but image generation fails: It says this file is missing:

_2F_unet_2F_down_5F_blocks_2E_1_2F_attentions_2E_0_2F_transformer_5F_blocks_2E_0_2F_ff_2F_net_2E_0_2F_Constant_5F_output_5F_0.bin

Cant find it here: https://huggingface.co/AeroX2/stable-diffusion-xl-turbo-1.0-onnxstream/blob/main/sdxl_unet_fp16/

AeroX2 commented 7 months ago

Hey @lustfeind

Looks like the web interface for Huggingface silently dropped all the other files I've just added a new commit that should contain the rest of the files, give it another go and let me know

vitoplantamura commented 7 months ago

I can confirm that it works.

Apart from the "raccoon" problem, I did several tests with OnnxStream and with the HF Diffusers in a Colab.

The problem is that SDXLTurbo outputs 512x512 images, while OnnxStream outputs 1024x1024 images (in its current configuration): @AeroX2 : could this be the origin of the different outputs?

lustfeind commented 7 months ago

Thanks, it works now but I think I did something wrong. 3 Steps image with 1024x1024 (would be great to set resolution via cmd,.. for example 768x512)

image_34562423

AeroX2 commented 7 months ago

Took a quick look at this today and yeah seems like it isn't the VAE, I was able to decode a latent file produced by OnnxStream with diffusers VAE and it produced the same output.

@vitoplantamura I think you might be onto something, the sample noise latent being passed to the unet in diffusers is 64x64 instead of the 128x128. I'm going to try and export with 64x64 latent input size, tomorrow and see how I go. Also considering how the images OnnxStream is producing are "overlapping" on top of each other I suspect this might be the right track to go down. sdxlturbo-mine-1steps-seed-239616

@lustfeind Are you using the new flag I introduced the --turbo flag, the image looks like it is being corrupted by the negative prompts, SDXL turbo doesn't support negative prompting.

vitoplantamura commented 7 months ago

I tried to generate some images with the HF Diffusers by setting width=1024, height=1024, and the result is very similar to the images produced by OnnxStream (ie repetition of subjects or part of the image). So I think we should try to have 512x512 images generated with OnnxStream, instead of 1024x1024 (maybe we can provide a flag just for SDXLTurbo to support both generations: 512x512 and 1024x1024).

@AeroX2 : Yes, you should re-export the SDXLTurbo UNET by specifying torch.randn(1, 4, 64, 64) for the "sample" input. In the OnnxStream code then, we can use tiled decoding to decode the latents into the final image. The current tiled decoder supports latents with shape (1, 4, 32, 32) so 3*3 (so 9) steps will be needed to generate the final image.

On this part, let me know if you need assistance.

Vito

lustfeind commented 7 months ago

Two flags for width x height would be great. There a several custom turbo models out there who generate at higher resolutions well. Also a tut for n00bs how to convert models for Onnxstream would be highly appreciated.

AeroX2 commented 7 months ago

Reducing the unet size did the trick, you can checkout the results on the model card: https://huggingface.co/AeroX2/stable-diffusion-xl-turbo-1.0-onnxstream As an additional benefit, because the unet model is smaller the inference time went from 67180ms 3 steps to 29499ms 3 steps!

Planning on making the PR soon, just need to figure out the tiling, modify the code and export the model correctly but should hopefully be soon...

Two flags for width x height would be great. There a several custom turbo models out there who generate at higher resolutions well.

This would be nice, and although I'm not familiar with how that is done with stable diffusion, it might be something I'll take a look into to but I'll see how things play out.

Also a tut for n00bs how to convert models for Onnxstream would be highly appreciated.

Might also take a look into this although sometimes I needed to fiddle about with the model.txt a little and onnx was quite picky about the python version it wanted so I'm not sure how n00b friendly we can make it but either way I'll probably write some stuff down, just so future people at least have a rough idea on how they might go about converting a custom model.

vitoplantamura commented 7 months ago

@AeroX2: Thanks for the PR! I plan to review it ASAP, including the test with the Raspberry PI Zero 2 :-)

Regarding the custom width and height, the problem is that this version of OnnxStream does not support input tensors with shapes with variable dimensions. This means that you would need to export the UNET and VAE model for each combination of width and height. However, the next version of OnnxStream will support this functionality (i.e. dynamic shapes for inputs) so this could be implemented in the future.

As for the tutorial, any contribution is welcome! However there is something already here:

https://github.com/vitoplantamura/OnnxStream#how-to-convert-and-run-a-custom-stable-diffusion-15-model-with-onnxstream-by-gaelicthunder

although I believe that a certain familiarity with python, with onnx and with the code being exported is still necessary.

Thanks, Vito

AeroX2 commented 6 months ago

Looks like everything is working, even on the Raspberry Pi! 🥳 So I'm going to close this issue out.

vitoplantamura commented 6 months ago

Sorry for the late reply: yes it works perfectly!

Thank you so much for your work and time!

Vito