Closed leejet closed 2 months ago
Heck yea. gonna test later when I remember.
btw, for some reason it shows 2mil new lines of code, but the diff certainly does not how did you manage to glitch this?
edit: nvm, call me blind, its the vocab.hpp
. do we really need it as code in the repo? also the name "vocab" is probably too generic.
Heck yea. gonna test later when I remember.
btw, for some reason it shows 2mil new lines of code, but the diff certainly does not how did you manage to glitch this?
edit: nvm, call me blind, its the
vocab.hpp
. do we really need it as code in the repo? also the name "vocab" is probably too generic.
vocab.hpp contains the vocabulary of clip and t5's tokenizer. Almost all sd models don't have this part, so I put it directly in the binary file so that we don't have to use an extra argument to specify the vocabulary file to use.
did a malformed 512x512, so defaults, run:
teddy bear with SD3 on a sign
question, what does the second text conditioning do?
[DEBUG] stable-diffusion.cpp:477 - finished loaded file
[DEBUG] stable-diffusion.cpp:1261 - txt2img 1024x1024
[DEBUG] stable-diffusion.cpp:1014 - prompt after extract and remove lora: "teddy bear with SD3 on a sign"
[INFO ] stable-diffusion.cpp:560 - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1019 - apply_loras completed, taking 0.00s
[DEBUG] conditioner.hpp:687 - parse 'teddy bear with SD3 on a sign' to [['teddy bear with SD3 on a sign', 1], ]
[DEBUG] clip.hpp:311 - token length: 77
[DEBUG] clip.hpp:311 - token length: 77
[DEBUG] t5.hpp:397 - token length: 77
[DEBUG] ggml_extend.hpp:932 - clip compute buffer size: 1.40 MB(RAM)
[DEBUG] ggml_extend.hpp:932 - clip compute buffer size: 2.33 MB(RAM)
[DEBUG] ggml_extend.hpp:932 - t5 compute buffer size: 11.94 MB(RAM)
[DEBUG] conditioner.hpp:930 - computing condition graph completed, taking 3815 ms
[DEBUG] conditioner.hpp:687 - parse '' to [['', 1], ]
[DEBUG] clip.hpp:311 - token length: 77
[DEBUG] clip.hpp:311 - token length: 77
[DEBUG] t5.hpp:397 - token length: 77
[DEBUG] ggml_extend.hpp:932 - clip compute buffer size: 1.40 MB(RAM)
[DEBUG] ggml_extend.hpp:932 - clip compute buffer size: 2.33 MB(RAM)
[DEBUG] ggml_extend.hpp:932 - t5 compute buffer size: 11.94 MB(RAM)
[DEBUG] conditioner.hpp:930 - computing condition graph completed, taking 3797 ms
[INFO ] stable-diffusion.cpp:1143 - get_learned_condition completed, taking 7614 ms
ok did a 1024x1024 with euler sampler
$ result/bin/sd -m ../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors -p "teddy bear with SD3 on a sign" -t 12 -W 1024 -H 1024 --cfg-scale 4.5 -v --sampling-method euler
The 89sec/iteration on cpu is pretty heavy, but it works. Also considering their size, the text encoders are fast too.
Running with cuda is very fast, however the vae is once again not fitting into vram and crashes.
[DEBUG] stable-diffusion.cpp:147 - Using CUDA backend
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes
[INFO ] stable-diffusion.cpp:229 - set clip_on_cpu to true
[INFO ] stable-diffusion.cpp:232 - CLIP: Using CPU backend
[DEBUG] ggml_extend.hpp:980 - clip params backend buffer size = 235.06 MB(RAM) (196 tensors)
[DEBUG] ggml_extend.hpp:980 - clip params backend buffer size = 1329.29 MB(RAM) (517 tensors)
[DEBUG] ggml_extend.hpp:980 - t5 params backend buffer size = 9083.77 MB(RAM) (219 tensors)
[DEBUG] ggml_extend.hpp:980 - mmdit params backend buffer size = 4086.83 MB(VRAM) (491 tensors)
[DEBUG] ggml_extend.hpp:980 - vae params backend buffer size = 94.57 MB(VRAM) (138 tensors)
``` $ result/bin/sd -m ../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors -p "teddy bear with SD3 on a sign" -t 12 -W 1024 -H 1024 --cfg-scale 4.5 -v --sampling-method euler Option: n_threads: 12 mode: txt2img model_path: ../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors wtype: unspecified vae_path: taesd_path: esrgan_path: controlnet_path: embeddings_path: stacked_id_embeddings_path: input_id_images_path: style ratio: 20.00 normzalize input image : false output_path: output.png init_img: control_image: clip on cpu: false controlnet cpu: false vae decoder on cpu:false strength(control): 0.90 prompt: teddy bear with SD3 on a sign negative_prompt: min_cfg: 1.00 cfg_scale: 4.50 clip_skip: -1 width: 1024 height: 1024 sample_method: euler schedule: default sample_steps: 20 strength(img2img): 0.75 rng: cuda seed: 42 batch_count: 1 vae_tiling: false upscale_repeats: 1 System Info: BLAS = 1 SSE3 = 1 AVX = 1 AVX2 = 1 AVX512 = 0 AVX512_VBMI = 0 AVX512_VNNI = 0 FMA = 1 NEON = 0 ARM_FMA = 0 F16C = 1 FP16_VA = 0 WASM_SIMD = 0 VSX = 0 [DEBUG] stable-diffusion.cpp:147 - Using CUDA backend ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 2070, compute capability 7.5, VMM: yes [INFO ] stable-diffusion.cpp:167 - loading model from '../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors' [INFO ] model.cpp:737 - load ../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors using safetensors format [DEBUG] model.cpp:803 - init from '../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors' [INFO ] stable-diffusion.cpp:190 - Stable Diffusion 3 2B [INFO ] stable-diffusion.cpp:196 - Stable Diffusion weight type: f16 [DEBUG] stable-diffusion.cpp:197 - ggml tensor size = 432 bytes [INFO ] stable-diffusion.cpp:229 - set clip_on_cpu to true [INFO ] stable-diffusion.cpp:232 - CLIP: Using CPU backend [DEBUG] clip.hpp:171 - vocab size: 49408 [DEBUG] clip.hpp:182 - trigger word img already in vocab [DEBUG] clip.hpp:171 - vocab size: 49408 [DEBUG] clip.hpp:182 - trigger word img already in vocab [DEBUG] ggml_extend.hpp:980 - clip params backend buffer size = 235.06 MB(RAM) (196 tensors) [DEBUG] ggml_extend.hpp:980 - clip params backend buffer size = 1329.29 MB(RAM) (517 tensors) [DEBUG] ggml_extend.hpp:980 - t5 params backend buffer size = 9083.77 MB(RAM) (219 tensors) [DEBUG] ggml_extend.hpp:980 - mmdit params backend buffer size = 4086.83 MB(VRAM) (491 tensors) [DEBUG] ggml_extend.hpp:980 - vae params backend buffer size = 94.57 MB(VRAM) (138 tensors) [DEBUG] stable-diffusion.cpp:319 - loading weights [DEBUG] model.cpp:1389 - loading tensors from ../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors [INFO ] model.cpp:1535 - unknown tensor 'text_encoders.t5xxl.transformer.encoder.embed_tokens.weight | f16 | 2 [4096, 32128, 1, 1, 1]' in model file [INFO ] stable-diffusion.cpp:403 - total params memory size = 14829.53MB (VRAM 4181.40MB, RAM 10648.13MB): clip 10648.13MB(RAM), unet 4086.83MB(VRAM), vae 94.57MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(RAM) [INFO ] stable-diffusion.cpp:422 - loading model from '../stable-diffusion-webui/models/Stable-diffusion/sd3_medium_incl_clips_t5xxlfp16.safetensors' completed, taking 10.88s [INFO ] stable-diffusion.cpp:436 - running in FLOW mode [DEBUG] stable-diffusion.cpp:477 - finished loaded file [DEBUG] stable-diffusion.cpp:1261 - txt2img 1024x1024 [DEBUG] stable-diffusion.cpp:1014 - prompt after extract and remove lora: "teddy bear with SD3 on a sign" [INFO ] stable-diffusion.cpp:560 - Attempting to apply 0 LoRAs [INFO ] stable-diffusion.cpp:1019 - apply_loras completed, taking 0.00s [DEBUG] conditioner.hpp:687 - parse 'teddy bear with SD3 on a sign' to [['teddy bear with SD3 on a sign', 1], ] [DEBUG] clip.hpp:311 - token length: 77 [DEBUG] clip.hpp:311 - token length: 77 [DEBUG] t5.hpp:397 - token length: 77 ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 1.40 MiB [DEBUG] ggml_extend.hpp:932 - clip compute buffer size: 1.40 MB(RAM) ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 2.33 MiB [DEBUG] ggml_extend.hpp:932 - clip compute buffer size: 2.33 MB(RAM) ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 11.94 MiB [DEBUG] ggml_extend.hpp:932 - t5 compute buffer size: 11.94 MB(RAM) [DEBUG] conditioner.hpp:930 - computing condition graph completed, taking 5070 ms [DEBUG] conditioner.hpp:687 - parse '' to [['', 1], ] [DEBUG] clip.hpp:311 - token length: 77 [DEBUG] clip.hpp:311 - token length: 77 [DEBUG] t5.hpp:397 - token length: 77 ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 1.40 MiB [DEBUG] ggml_extend.hpp:932 - clip compute buffer size: 1.40 MB(RAM) ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 2.33 MiB [DEBUG] ggml_extend.hpp:932 - clip compute buffer size: 2.33 MB(RAM) ggml_gallocr_reserve_n: reallocating CPU buffer from size 0.00 MiB to 11.94 MiB [DEBUG] ggml_extend.hpp:932 - t5 compute buffer size: 11.94 MB(RAM) [DEBUG] conditioner.hpp:930 - computing condition graph completed, taking 5045 ms [INFO ] stable-diffusion.cpp:1143 - get_learned_condition completed, taking 10119 ms [INFO ] stable-diffusion.cpp:1164 - sampling using Euler method [INFO ] stable-diffusion.cpp:1168 - generating image: 1/1 - seed 42 ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 1786.11 MiB [DEBUG] ggml_extend.hpp:932 - mmdit compute buffer size: 1786.11 MB(VRAM) |==================================================| 20/20 - 4.09s/it [INFO ] stable-diffusion.cpp:1199 - sampling completed, taking 82.10s [INFO ] stable-diffusion.cpp:1207 - generating 1 latent images completed, taking 83.02s [INFO ] stable-diffusion.cpp:1210 - decoding 1 latents ggml_gallocr_reserve_n: reallocating CUDA0 buffer from size 0.00 MiB to 6656.00 MiB [DEBUG] ggml_extend.hpp:932 - vae compute buffer size: 6656.00 MB(VRAM) CUDA error: out of memory current device: 0, in function alloc at /build/k1y3zc6zgzlm78m3lslh1m453fn3gsmg-source/ggml/src/ggml-cuda.cu:357 cuMemCreate(&handle, reserve_size, &prop, 0) GGML_ASSERT: /build/k1y3zc6zgzlm78m3lslh1m453fn3gsmg-source/ggml/src/ggml-cuda.cu:100: !"CUDA error" ptrace: Operation not permitted. No stack. The program is not being run. Aborted (core dumped) ```
However, hardcoding vae_on_cpu
to true
makes it work with 4.10 s/it
.
[INFO ] stable-diffusion.cpp:255 - VAE Autoencoder: Using CPU backend
[DEBUG] ggml_extend.hpp:932 - vae compute buffer size: 6656.00 MB(RAM)
[DEBUG] stable-diffusion.cpp:884 - computing vae [mode: DECODE] graph completed, taking 85.62s
[INFO ] stable-diffusion.cpp:1220 - latent 1 decoded, taking 85.62s
[INFO ] stable-diffusion.cpp:1224 - decode_first_stage completed, taking 85.62s
[INFO ] stable-diffusion.cpp:1324 - txt2img completed in 179.18s
taesd3 does not work:
[INFO ] tae.hpp:204 - loading taesd from '../stable-diffusion-webui/models/VAE-approx/taesd3_diffusion_pytorch_model.safetensors', decode_only = true
[DEBUG] ggml_extend.hpp:980 - taesd params backend buffer size = 4.67 MB(VRAM) (134 tensors)
[INFO ] model.cpp:737 - load ../stable-diffusion-webui/models/VAE-approx/taesd3_diffusion_pytorch_model.safetensors using safetensors format
[DEBUG] model.cpp:803 - init from '../stable-diffusion-webui/models/VAE-approx/taesd3_diffusion_pytorch_model.safetensors'
[DEBUG] model.cpp:1389 - loading tensors from ../stable-diffusion-webui/models/VAE-approx/taesd3_diffusion_pytorch_model.safetensors
[ERROR] model.cpp:1544 - tensor 'decoder.layers.0.weight' has wrong shape in model file: got [3, 3, 16, 64], expected [3, 3, 4, 64]
[WARN ] model.cpp:1451 - process tensor failed: 'decoder.layers.0.weight'
[ERROR] model.cpp:1560 - load tensors from file failed
[ERROR] tae.hpp:222 - load tae tensors from model loader failed
new_sd_ctx_t failed
edit: also loras dont work yet. (tried flash-diffusion)
@leejet hello!, in diffusion_model.hpp there is a typo DiffuisionModel
class name
@leejet hello!, in diffusion_model.hpp there is a typo
DiffuisionModel
class name
Fixed!
The support for taesd3 and lora will be added later in a separate PR. This PR seems to have all its functionalities completed, so I'll merge it now.
How to Use