leejet / stable-diffusion.cpp

Stable Diffusion and Flux in pure C/C++
MIT License
3.34k stars 279 forks source link

Add sd3 support #298

Closed leejet closed 2 months ago

leejet commented 3 months ago

How to Use

  1. Pull the latest code and build.
  2. Download sd3_medium_incl_clips_t5xxlfp16.safetensors from https://huggingface.co/stabilityai/stable-diffusion-3-medium.
  3. Try prompts you like, for example:
    ./bin/sd -m ../models/sd3_medium_incl_clips_t5xxlfp16.safetensors -H 1024 -W 1024 -p 'a lovely cat holding a sign says \"Stable diffusion 3\"' --cfg-scale 4.5 --sampling-method euler -v

    sd3

Green-Sky commented 3 months ago

Heck yea. gonna test later when I remember.

btw, for some reason it shows 2mil new lines of code, but the diff certainly does not image how did you manage to glitch this?

edit: nvm, call me blind, its the vocab.hpp. do we really need it as code in the repo? also the name "vocab" is probably too generic.

leejet commented 3 months ago

Heck yea. gonna test later when I remember.

btw, for some reason it shows 2mil new lines of code, but the diff certainly does not image how did you manage to glitch this?

edit: nvm, call me blind, its the vocab.hpp. do we really need it as code in the repo? also the name "vocab" is probably too generic.

vocab.hpp contains the vocabulary of clip and t5's tokenizer. Almost all sd models don't have this part, so I put it directly in the binary file so that we don't have to use an extra argument to specify the vocabulary file to use.

Green-Sky commented 3 months ago

did a malformed 512x512, so defaults, run: teddy bear with SD3 on a sign

output

Green-Sky commented 3 months ago

question, what does the second text conditioning do?

[DEBUG] stable-diffusion.cpp:477  - finished loaded file
[DEBUG] stable-diffusion.cpp:1261 - txt2img 1024x1024
[DEBUG] stable-diffusion.cpp:1014 - prompt after extract and remove lora: "teddy bear with SD3 on a sign"
[INFO ] stable-diffusion.cpp:560  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1019 - apply_loras completed, taking 0.00s
[DEBUG] conditioner.hpp:687  - parse 'teddy bear with SD3 on a sign' to [['teddy bear with SD3 on a sign', 1], ]
[DEBUG] clip.hpp:311  - token length: 77
[DEBUG] clip.hpp:311  - token length: 77
[DEBUG] t5.hpp:397  - token length: 77
[DEBUG] ggml_extend.hpp:932  - clip compute buffer size: 1.40 MB(RAM)
[DEBUG] ggml_extend.hpp:932  - clip compute buffer size: 2.33 MB(RAM)
[DEBUG] ggml_extend.hpp:932  - t5 compute buffer size: 11.94 MB(RAM)
[DEBUG] conditioner.hpp:930  - computing condition graph completed, taking 3815 ms
[DEBUG] conditioner.hpp:687  - parse '' to [['', 1], ]
[DEBUG] clip.hpp:311  - token length: 77
[DEBUG] clip.hpp:311  - token length: 77
[DEBUG] t5.hpp:397  - token length: 77
[DEBUG] ggml_extend.hpp:932  - clip compute buffer size: 1.40 MB(RAM)
[DEBUG] ggml_extend.hpp:932  - clip compute buffer size: 2.33 MB(RAM)
[DEBUG] ggml_extend.hpp:932  - t5 compute buffer size: 11.94 MB(RAM)
[DEBUG] conditioner.hpp:930  - computing condition graph completed, taking 3797 ms
[INFO ] stable-diffusion.cpp:1143 - get_learned_condition completed, taking 7614 ms
Green-Sky commented 3 months ago

ok did a 1024x1024 with euler sampler output

$ result/bin/sd -m ../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors -p "teddy bear with SD3 on a sign" -t 12 -W 1024 -H 1024 --cfg-scale 4.5 -v --sampling-method euler

Details ``` Option: n_threads: 12 mode: txt2img model_path: ../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors wtype: unspecified vae_path: taesd_path: esrgan_path: controlnet_path: embeddings_path: stacked_id_embeddings_path: input_id_images_path: style ratio: 20.00 normzalize input image : false output_path: output.png init_img: control_image: clip on cpu: false controlnet cpu: false vae decoder on cpu:false strength(control): 0.90 prompt: teddy bear with SD3 on a sign negative_prompt: min_cfg: 1.00 cfg_scale: 4.50 clip_skip: -1 width: 1024 height: 1024 sample_method: euler schedule: default sample_steps: 20 strength(img2img): 0.75 rng: cuda seed: 42 batch_count: 1 vae_tiling: false upscale_repeats: 1 System Info: BLAS = 0 SSE3 = 1 AVX = 1 AVX2 = 1 AVX512 = 0 AVX512_VBMI = 0 AVX512_VNNI = 0 FMA = 1 NEON = 0 ARM_FMA = 0 F16C = 1 FP16_VA = 0 WASM_SIMD = 0 VSX = 0 [DEBUG] stable-diffusion.cpp:157 - Using CPU backend [INFO ] stable-diffusion.cpp:167 - loading model from '../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors' [INFO ] model.cpp:737 - load ../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors using safetensors format [DEBUG] model.cpp:803 - init from '../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors' [INFO ] stable-diffusion.cpp:190 - Stable Diffusion 3 2B [INFO ] stable-diffusion.cpp:196 - Stable Diffusion weight type: f16 [DEBUG] stable-diffusion.cpp:197 - ggml tensor size = 432 bytes [DEBUG] clip.hpp:171 - vocab size: 49408 [DEBUG] clip.hpp:182 - trigger word img already in vocab [DEBUG] clip.hpp:171 - vocab size: 49408 [DEBUG] clip.hpp:182 - trigger word img already in vocab [DEBUG] ggml_extend.hpp:980 - clip params backend buffer size = 235.06 MB(RAM) (196 tensors) [DEBUG] ggml_extend.hpp:980 - clip params backend buffer size = 1329.29 MB(RAM) (517 tensors) [DEBUG] ggml_extend.hpp:980 - t5 params backend buffer size = 9083.77 MB(RAM) (219 tensors) [DEBUG] ggml_extend.hpp:980 - mmdit params backend buffer size = 4086.83 MB(RAM) (491 tensors) [DEBUG] ggml_extend.hpp:980 - vae params backend buffer size = 94.57 MB(RAM) (138 tensors) [DEBUG] stable-diffusion.cpp:319 - loading weights [DEBUG] model.cpp:1389 - loading tensors from ../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors [INFO ] model.cpp:1535 - unknown tensor 'text_encoders.t5xxl.transformer.encoder.embed_tokens.weight | f16 | 2 [4096, 32128, 1, 1, 1]' in model file [INFO ] stable-diffusion.cpp:403 - total params memory size = 14829.53MB (VRAM 0.00MB, RAM 14829.53MB): clip 10648.13MB(RAM), unet 4086.83MB(RAM), vae 94.57MB(RAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM) [INFO ] stable-diffusion.cpp:422 - loading model from '../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors' completed, taking 17.16s [INFO ] stable-diffusion.cpp:436 - running in FLOW mode [DEBUG] stable-diffusion.cpp:477 - finished loaded file [DEBUG] stable-diffusion.cpp:1261 - txt2img 1024x1024 [DEBUG] stable-diffusion.cpp:1014 - prompt after extract and remove lora: "teddy bear with SD3 on a sign" [INFO ] stable-diffusion.cpp:560 - Attempting to apply 0 LoRAs [INFO ] stable-diffusion.cpp:1019 - apply_loras completed, taking 0.00s [DEBUG] conditioner.hpp:687 - parse 'teddy bear with SD3 on a sign' to [['teddy bear with SD3 on a sign', 1], ] [DEBUG] clip.hpp:311 - token length: 77 [DEBUG] clip.hpp:311 - token length: 77 [DEBUG] t5.hpp:397 - token length: 77 [DEBUG] ggml_extend.hpp:932 - clip compute buffer size: 1.40 MB(RAM) [DEBUG] ggml_extend.hpp:932 - clip compute buffer size: 2.33 MB(RAM) [DEBUG] ggml_extend.hpp:932 - t5 compute buffer size: 11.94 MB(RAM) [DEBUG] conditioner.hpp:930 - computing condition graph completed, taking 3294 ms [DEBUG] conditioner.hpp:687 - parse '' to [['', 1], ] [DEBUG] clip.hpp:311 - token length: 77 [DEBUG] clip.hpp:311 - token length: 77 [DEBUG] t5.hpp:397 - token length: 77 [DEBUG] ggml_extend.hpp:932 - clip compute buffer size: 1.40 MB(RAM) [DEBUG] ggml_extend.hpp:932 - clip compute buffer size: 2.33 MB(RAM) [DEBUG] ggml_extend.hpp:932 - t5 compute buffer size: 11.94 MB(RAM) [DEBUG] conditioner.hpp:930 - computing condition graph completed, taking 3302 ms [INFO ] stable-diffusion.cpp:1143 - get_learned_condition completed, taking 6599 ms [INFO ] stable-diffusion.cpp:1164 - sampling using Euler method [INFO ] stable-diffusion.cpp:1168 - generating image: 1/1 - seed 42 [DEBUG] ggml_extend.hpp:932 - mmdit compute buffer size: 1784.58 MB(RAM) |==================================================| 20/20 - 89.26s/it [INFO ] stable-diffusion.cpp:1199 - sampling completed, taking 1793.16s [INFO ] stable-diffusion.cpp:1207 - generating 1 latent images completed, taking 1794.48s [INFO ] stable-diffusion.cpp:1210 - decoding 1 latents [DEBUG] ggml_extend.hpp:932 - vae compute buffer size: 6656.00 MB(RAM) [DEBUG] stable-diffusion.cpp:884 - computing vae [mode: DECODE] graph completed, taking 62.70s [INFO ] stable-diffusion.cpp:1220 - latent 1 decoded, taking 62.70s [INFO ] stable-diffusion.cpp:1224 - decode_first_stage completed, taking 62.70s [INFO ] stable-diffusion.cpp:1324 - txt2img completed in 1863.80s save result image to 'output.png' ```

The 89sec/iteration on cpu is pretty heavy, but it works. Also considering their size, the text encoders are fast too.

Green-Sky commented 3 months ago

Running with cuda is very fast, however the vae is once again not fitting into vram and crashes.

[DEBUG] stable-diffusion.cpp:147  - Using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
[INFO ] stable-diffusion.cpp:229  - set clip_on_cpu to true
[INFO ] stable-diffusion.cpp:232  - CLIP: Using CPU backend
[DEBUG] ggml_extend.hpp:980  - clip params backend buffer size =  235.06 MB(RAM) (196 tensors)
[DEBUG] ggml_extend.hpp:980  - clip params backend buffer size =  1329.29 MB(RAM) (517 tensors)
[DEBUG] ggml_extend.hpp:980  - t5 params backend buffer size =  9083.77 MB(RAM) (219 tensors)
[DEBUG] ggml_extend.hpp:980  - mmdit params backend buffer size =  4086.83 MB(VRAM) (491 tensors)
[DEBUG] ggml_extend.hpp:980  - vae params backend buffer size =  94.57 MB(VRAM) (138 tensors)
Details

``` $ result/bin/sd -m ../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors -p "teddy bear with SD3 on a sign" -t 12 -W 1024 -H 1024 --cfg-scale 4.5 -v --sampling-method euler Option: n_threads: 12 mode: txt2img model_path: ../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors wtype: unspecified vae_path: taesd_path: esrgan_path: controlnet_path: embeddings_path: stacked_id_embeddings_path: input_id_images_path: style ratio: 20.00 normzalize input image : false output_path: output.png init_img: control_image: clip on cpu: false controlnet cpu: false vae decoder on cpu:false strength(control): 0.90 prompt: teddy bear with SD3 on a sign negative_prompt: min_cfg: 1.00 cfg_scale: 4.50 clip_skip: -1 width: 1024 height: 1024 sample_method: euler schedule: default sample_steps: 20 strength(img2img): 0.75 rng: cuda seed: 42 batch_count: 1 vae_tiling: false upscale_repeats: 1 System Info: BLAS = 1 SSE3 = 1 AVX = 1 AVX2 = 1 AVX512 = 0 AVX512_VBMI = 0 AVX512_VNNI = 0 FMA = 1 NEON = 0 ARM_FMA = 0 F16C = 1 FP16_VA = 0 WASM_SIMD = 0 VSX = 0 [DEBUG] stable-diffusion.cpp:147 - Using CUDA backend ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes [INFO ] stable-diffusion.cpp:167 - loading model from '../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors' [INFO ] model.cpp:737 - load ../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors using safetensors format [DEBUG] model.cpp:803 - init from '../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors' [INFO ] stable-diffusion.cpp:190 - Stable Diffusion 3 2B [INFO ] stable-diffusion.cpp:196 - Stable Diffusion weight type: f16 [DEBUG] stable-diffusion.cpp:197 - ggml tensor size = 432 bytes [INFO ] stable-diffusion.cpp:229 - set clip_on_cpu to true [INFO ] stable-diffusion.cpp:232 - CLIP: Using CPU backend [DEBUG] clip.hpp:171 - vocab size: 49408 [DEBUG] clip.hpp:182 - trigger word img already in vocab [DEBUG] clip.hpp:171 - vocab size: 49408 [DEBUG] clip.hpp:182 - trigger word img already in vocab [DEBUG] ggml_extend.hpp:980 - clip params backend buffer size = 235.06 MB(RAM) (196 tensors) [DEBUG] ggml_extend.hpp:980 - clip params backend buffer size = 1329.29 MB(RAM) (517 tensors) [DEBUG] ggml_extend.hpp:980 - t5 params backend buffer size = 9083.77 MB(RAM) (219 tensors) [DEBUG] ggml_extend.hpp:980 - mmdit params backend buffer size = 4086.83 MB(VRAM) (491 tensors) [DEBUG] ggml_extend.hpp:980 - vae params backend buffer size = 94.57 MB(VRAM) (138 tensors) [DEBUG] stable-diffusion.cpp:319 - loading weights [DEBUG] model.cpp:1389 - loading tensors from ../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors [INFO ] model.cpp:1535 - unknown tensor 'text_encoders.t5xxl.transformer.encoder.embed_tokens.weight | f16 | 2 [4096, 32128, 1, 1, 1]' in model file [INFO ] stable-diffusion.cpp:403 - total params memory size = 14829.53MB (VRAM 4181.40MB, RAM 10648.13MB): clip 10648.13MB(RAM), unet 4086.83MB(VRAM), vae 94.57MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM) [INFO ] stable-diffusion.cpp:422 - loading model from '../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors' completed, taking 10.88s [INFO ] stable-diffusion.cpp:436 - running in FLOW mode [DEBUG] stable-diffusion.cpp:477 - finished loaded file [DEBUG] stable-diffusion.cpp:1261 - txt2img 1024x1024 [DEBUG] stable-diffusion.cpp:1014 - prompt after extract and remove lora: "teddy bear with SD3 on a sign" [INFO ] stable-diffusion.cpp:560 - Attempting to apply 0 LoRAs [INFO ] stable-diffusion.cpp:1019 - apply_loras completed, taking 0.00s [DEBUG] conditioner.hpp:687 - parse 'teddy bear with SD3 on a sign' to [['teddy bear with SD3 on a sign', 1], ] [DEBUG] clip.hpp:311 - token length: 77 [DEBUG] clip.hpp:311 - token length: 77 [DEBUG] t5.hpp:397 - token length: 77 ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 1.40 MiB [DEBUG] ggml_extend.hpp:932 - clip compute buffer size: 1.40 MB(RAM) ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 2.33 MiB [DEBUG] ggml_extend.hpp:932 - clip compute buffer size: 2.33 MB(RAM) ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 11.94 MiB [DEBUG] ggml_extend.hpp:932 - t5 compute buffer size: 11.94 MB(RAM) [DEBUG] conditioner.hpp:930 - computing condition graph completed, taking 5070 ms [DEBUG] conditioner.hpp:687 - parse '' to [['', 1], ] [DEBUG] clip.hpp:311 - token length: 77 [DEBUG] clip.hpp:311 - token length: 77 [DEBUG] t5.hpp:397 - token length: 77 ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 1.40 MiB [DEBUG] ggml_extend.hpp:932 - clip compute buffer size: 1.40 MB(RAM) ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 2.33 MiB [DEBUG] ggml_extend.hpp:932 - clip compute buffer size: 2.33 MB(RAM) ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 11.94 MiB [DEBUG] ggml_extend.hpp:932 - t5 compute buffer size: 11.94 MB(RAM) [DEBUG] conditioner.hpp:930 - computing condition graph completed, taking 5045 ms [INFO ] stable-diffusion.cpp:1143 - get_learned_condition completed, taking 10119 ms [INFO ] stable-diffusion.cpp:1164 - sampling using Euler method [INFO ] stable-diffusion.cpp:1168 - generating image: 1/1 - seed 42 ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 1786.11 MiB [DEBUG] ggml_extend.hpp:932 - mmdit compute buffer size: 1786.11 MB(VRAM) |==================================================| 20/20 - 4.09s/it [INFO ] stable-diffusion.cpp:1199 - sampling completed, taking 82.10s [INFO ] stable-diffusion.cpp:1207 - generating 1 latent images completed, taking 83.02s [INFO ] stable-diffusion.cpp:1210 - decoding 1 latents ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 6656.00 MiB [DEBUG] ggml_extend.hpp:932 - vae compute buffer size: 6656.00 MB(VRAM) CUDA error: out of memory current device: 0, in function alloc at /build/k1y3zc6zgzlm78m3lslh1m453fn3gsmg-source/ggml/src/ggml-cuda.cu:357 cuMemCreate(&handle, reserve_size, &prop, 0) GGML_ASSERT: /build/k1y3zc6zgzlm78m3lslh1m453fn3gsmg-source/ggml/src/ggml-cuda.cu:100: !"CUDA error" ptrace: Operation not permitted. No stack. The program is not being run. Aborted (core dumped) ```


However, hardcoding vae_on_cpu to true makes it work with 4.10 s/it.

[INFO ] stable-diffusion.cpp:255  - VAE Autoencoder: Using CPU backend
[DEBUG] ggml_extend.hpp:932  - vae compute buffer size: 6656.00 MB(RAM)
[DEBUG] stable-diffusion.cpp:884  - computing vae [mode: DECODE] graph completed, taking 85.62s
[INFO ] stable-diffusion.cpp:1220 - latent 1 decoded, taking 85.62s
[INFO ] stable-diffusion.cpp:1224 - decode_first_stage completed, taking 85.62s
[INFO ] stable-diffusion.cpp:1324 - txt2img completed in 179.18s

output

Green-Sky commented 3 months ago

taesd3 does not work:

[INFO ] tae.hpp:204  - loading taesd from '../stable-diffusion-webui/models/VAE-approx/taesd3_diffusion_pytorch_model.safetensors', decode_only = true
[DEBUG] ggml_extend.hpp:980  - taesd params backend buffer size =   4.67 MB(VRAM) (134 tensors)
[INFO ] model.cpp:737  - load ../stable-diffusion-webui/models/VAE-approx/taesd3_diffusion_pytorch_model.safetensors using safetensors format
[DEBUG] model.cpp:803  - init from '../stable-diffusion-webui/models/VAE-approx/taesd3_diffusion_pytorch_model.safetensors'
[DEBUG] model.cpp:1389 - loading tensors from ../stable-diffusion-webui/models/VAE-approx/taesd3_diffusion_pytorch_model.safetensors
[ERROR] model.cpp:1544 - tensor 'decoder.layers.0.weight' has wrong shape in model file: got [3, 3, 16, 64], expected [3, 3, 4, 64]
[WARN ] model.cpp:1451 - process tensor failed: 'decoder.layers.0.weight'
[ERROR] model.cpp:1560 - load tensors from file failed
[ERROR] tae.hpp:222  - load tae tensors from model loader failed
new_sd_ctx_t failed

edit: also loras dont work yet. (tried flash-diffusion)

FSSRepo commented 3 months ago

@leejet hello!, in diffusion_model.hpp there is a typo DiffuisionModel class name

leejet commented 2 months ago

@leejet hello!, in diffusion_model.hpp there is a typo DiffuisionModel class name

Fixed!

leejet commented 2 months ago

The support for taesd3 and lora will be added later in a separate PR. This PR seems to have all its functionalities completed, so I'll merge it now.