First impressions info dump

Hey, finally stable diffusion for ggml :smile:

Did a test run

$ ./sd -t 8 -m ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "alps, distant alms, small church, (cinematic:1.3), intricate details, (ArtStation:1.2), nikon dlsr, masterpiece, hyperreal"
[INFO]  stable-diffusion.cpp:2189 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin'
[INFO]  stable-diffusion.cpp:2214 - ftype: q8_0
[INFO]  stable-diffusion.cpp:2259 - params ctx size =  1618.72 MB
[INFO]  stable-diffusion.cpp:2399 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin' completed, taking 0.46s
[INFO]  stable-diffusion.cpp:2477 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2477 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2822 - get_learned_condition completed, taking 0.16s
[INFO]  stable-diffusion.cpp:2830 - start sampling
[INFO]  stable-diffusion.cpp:2674 - step 1 sampling completed, taking 18.34s
[INFO]  stable-diffusion.cpp:2674 - step 2 sampling completed, taking 18.24s
[INFO]  stable-diffusion.cpp:2674 - step 3 sampling completed, taking 18.65s
[INFO]  stable-diffusion.cpp:2674 - step 4 sampling completed, taking 18.41s
[INFO]  stable-diffusion.cpp:2674 - step 5 sampling completed, taking 18.31s
[INFO]  stable-diffusion.cpp:2674 - step 6 sampling completed, taking 18.18s
[INFO]  stable-diffusion.cpp:2674 - step 7 sampling completed, taking 18.21s
[INFO]  stable-diffusion.cpp:2674 - step 8 sampling completed, taking 18.29s
[INFO]  stable-diffusion.cpp:2674 - step 9 sampling completed, taking 18.21s
[INFO]  stable-diffusion.cpp:2674 - step 10 sampling completed, taking 18.28s
[INFO]  stable-diffusion.cpp:2674 - step 11 sampling completed, taking 18.19s
[INFO]  stable-diffusion.cpp:2674 - step 12 sampling completed, taking 18.00s
[INFO]  stable-diffusion.cpp:2674 - step 13 sampling completed, taking 18.03s
[INFO]  stable-diffusion.cpp:2674 - step 14 sampling completed, taking 18.54s
[INFO]  stable-diffusion.cpp:2674 - step 15 sampling completed, taking 18.32s
[INFO]  stable-diffusion.cpp:2674 - step 16 sampling completed, taking 18.41s
[INFO]  stable-diffusion.cpp:2674 - step 17 sampling completed, taking 18.29s
[INFO]  stable-diffusion.cpp:2674 - step 18 sampling completed, taking 18.51s
[INFO]  stable-diffusion.cpp:2674 - step 19 sampling completed, taking 18.62s
[INFO]  stable-diffusion.cpp:2674 - step 20 sampling completed, taking 18.11s
[INFO]  stable-diffusion.cpp:2686 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[INFO]  stable-diffusion.cpp:2835 - sampling completed, taking 366.14s
[INFO]  stable-diffusion.cpp:2766 - vae graph use 2177.12MB of memory: static 1153.12MB, dynamic = 1024.00MB
[INFO]  stable-diffusion.cpp:2842 - decode_first_stage completed, taking 57.66s
[INFO]  stable-diffusion.cpp:2843 - txt2img completed in 423.96s, with a runtime memory usage of 2177.12MB and parameter memory usage of 1618.58MB
save result image to 'output.png'

output

Painpoint: the extra python libs for conversion. Got a pip install error bc i have an incompatible version of something installed already, convert.py worked anyway though. :)

Timings: I used the q8_0 quantization and ran with different thread counts: I have a 12core(24threads) cpu. I took the timing of a sampling step.	quant	q8_0	q4_0
-t 1	75.31s	75.20s	82.92s
-t 2	42.44s
-t 4	28.65s	29.23s	30.00s
-t 6	21.68s
-t 8	18.34s	18.89s	19.05s
-t 10	16.38s	16.78s	17.61s
-t 12	16.26s	16.98s	18.11s
-t 14	17.93s
-t 16	16.80s
-t 18	16.70s
-t 20	16.20s
-t 22	16.96s
-t 24	18.93s

Additional questions:

do you have/plan to support token weighing? ( eg: (cinematic:1.3) )
are you looking into supporting cuda/opencl backends from ggml?
are you looking into k-quants (like llama.cpp) and some form of quality mesurement of quantizations? (since k-quants use different quant for different parts of the model)
it would be nice if the tool printed the "system line" (see https://github.com/ggerganov/llama.cpp/blob/f64d44a9b9581cd58f7ec40f4fa1c3ca5ca18e1e/llama.cpp#L4267 )
did not see it mentioned, does it support sd 2.x / do you plan to add support for that ?
my little benchmark suggests the bottleneck is not the model file, but the dynamic data. What number type do you use for it, llama.cpp has shown little to no degradation in quality when using f16 instead of f32 for the kv-cache.

edit: added f16 timings

Hey, finally stable diffusion for ggml 😄

Did a test run
$ ./sd -t 8 -m ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "alps, distant alms, small church, (cinematic:1.3), intricate details, (ArtStation:1.2), nikon dlsr, masterpiece, hyperreal"
[INFO]  stable-diffusion.cpp:2189 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin'
[INFO]  stable-diffusion.cpp:2214 - ftype: q8_0
[INFO]  stable-diffusion.cpp:2259 - params ctx size =  1618.72 MB
[INFO]  stable-diffusion.cpp:2399 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin' completed, taking 0.46s
[INFO]  stable-diffusion.cpp:2477 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2477 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2822 - get_learned_condition completed, taking 0.16s
[INFO]  stable-diffusion.cpp:2830 - start sampling
[INFO]  stable-diffusion.cpp:2674 - step 1 sampling completed, taking 18.34s
[INFO]  stable-diffusion.cpp:2674 - step 2 sampling completed, taking 18.24s
[INFO]  stable-diffusion.cpp:2674 - step 3 sampling completed, taking 18.65s
[INFO]  stable-diffusion.cpp:2674 - step 4 sampling completed, taking 18.41s
[INFO]  stable-diffusion.cpp:2674 - step 5 sampling completed, taking 18.31s
[INFO]  stable-diffusion.cpp:2674 - step 6 sampling completed, taking 18.18s
[INFO]  stable-diffusion.cpp:2674 - step 7 sampling completed, taking 18.21s
[INFO]  stable-diffusion.cpp:2674 - step 8 sampling completed, taking 18.29s
[INFO]  stable-diffusion.cpp:2674 - step 9 sampling completed, taking 18.21s
[INFO]  stable-diffusion.cpp:2674 - step 10 sampling completed, taking 18.28s
[INFO]  stable-diffusion.cpp:2674 - step 11 sampling completed, taking 18.19s
[INFO]  stable-diffusion.cpp:2674 - step 12 sampling completed, taking 18.00s
[INFO]  stable-diffusion.cpp:2674 - step 13 sampling completed, taking 18.03s
[INFO]  stable-diffusion.cpp:2674 - step 14 sampling completed, taking 18.54s
[INFO]  stable-diffusion.cpp:2674 - step 15 sampling completed, taking 18.32s
[INFO]  stable-diffusion.cpp:2674 - step 16 sampling completed, taking 18.41s
[INFO]  stable-diffusion.cpp:2674 - step 17 sampling completed, taking 18.29s
[INFO]  stable-diffusion.cpp:2674 - step 18 sampling completed, taking 18.51s
[INFO]  stable-diffusion.cpp:2674 - step 19 sampling completed, taking 18.62s
[INFO]  stable-diffusion.cpp:2674 - step 20 sampling completed, taking 18.11s
[INFO]  stable-diffusion.cpp:2686 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[INFO]  stable-diffusion.cpp:2835 - sampling completed, taking 366.14s
[INFO]  stable-diffusion.cpp:2766 - vae graph use 2177.12MB of memory: static 1153.12MB, dynamic = 1024.00MB
[INFO]  stable-diffusion.cpp:2842 - decode_first_stage completed, taking 57.66s
[INFO]  stable-diffusion.cpp:2843 - txt2img completed in 423.96s, with a runtime memory usage of 2177.12MB and parameter memory usage of 1618.58MB
save result image to 'output.png'
Painpoint: the extra python libs for conversion. Got a pip install error bc i have an incompatible version of something installed already, convert.py worked anyway though. :)

Timings: I used the q8_0 quantization and ran with different thread counts: I have a 12core(24threads) cpu. I took the timing of a sampling step.

quant q8_0 q4_0 f16 -t 1 75.31s 75.20s 82.92s -t 2 42.44s
-t 4 28.65s 29.23s 30.00s -t 6 21.68s
-t 8 18.34s 18.89s 19.05s -t 10 16.38s 16.78s 17.61s -t 12 16.26s 16.98s 18.11s -t 14 17.93s
-t 16 16.80s
-t 18 16.70s
-t 20 16.20s
-t 22 16.96s
-t 24 18.93s
Additional questions:

do you have/plan to support token weighing? ( eg: (cinematic:1.3) )

are you looking into supporting cuda/opencl backends from ggml?

are you looking into k-quants (like llama.cpp) and some form of quality mesurement of quantizations? (since k-quants use different quant for different parts of the model)

it would be nice if the tool printed the "system line" (see https://github.com/ggerganov/llama.cpp/blob/f64d44a9b9581cd58f7ec40f4fa1c3ca5ca18e1e/llama.cpp#L4267 )

did not see it mentioned, does it support sd 2.x / do you plan to add support for that ?

my little benchmark suggests the bottleneck is not the model file, but the dynamic data. What number type do you use for it, llama.cpp has shown little to no degradation in quality when using f16 instead of f32 for the kv-cache.

edit: added f16 timings

Thanks for the feedback.

Yes, I'm preparing to support an tokenizer in the style of stable-diffusion-webui, which includes token weighing.
I'm working on adding GPU support and currently focusing on getting ggml_conv_2d to function on the GPU. Because ggml_conv_2d only supports CPU now.
Great idea, I'll add this to the TODO list.
You can add the -v or --verbose parameter, which will allow you to see the system info.
Currently, only SD 1.x is supported. Support for SD 2.x will be added in the future.
Yes, a relatively large amount of memory is being used to store dynamic data (which is actually an optimized outcome). GGML currently utilizes f32 to store temporary calculation results. Changing it to f16 would reduce dynamic memory usage by half. I'm currently contemplating how to modify GGML to achieve this goal.

Yes, I'm preparing to support an tokenizer in the style of stable-diffusion-webui, which includes token weighing.

very nice

I'm working on adding GPU support and currently focusing on getting ggml_conv_2d to function on the GPU. Because ggml_conv_2d only supports CPU now.

i see

You can add the -v or --verbose parameter, which will allow you to see the system info.

oh, i overlooked that one

run with verbose

``` $ ./sd -t 10 -m ../models/v1-5-pruned-emaonly-ggml-model-f16.bin -p "alps, distant alms, small church, (cinematic:1.3), intricate details, (ArtStation:1.2), nikon dlsr, masterpiece, hyperreal" -v Option: n_threads: 10 model_path: ../models/v1-5-pruned-emaonly-ggml-model-f16.bin output_path: output.png prompt: alps, distant alms, small church, (cinematic:1.3), intricate details, (ArtStation:1.2), nikon dlsr, masterpiece, hyperreal negative_prompt: cfg_scale: 7.00 width: 512 height: 512 sample_method: eular a sample_steps: 20 seed: 42 System Info: BLAS = 0 SSE3 = 1 AVX = 1 AVX2 = 1 AVX512 = 0 AVX512_VBMI = 0 AVX512_VNNI = 0 FMA = 1 NEON = 0 ARM_FMA = 0 F16C = 1 FP16_VA = 0 WASM_SIMD = 0 VSX = 0 [INFO] stable-diffusion.cpp:2189 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-f16.bin' [DEBUG] stable-diffusion.cpp:2197 - verifying magic [DEBUG] stable-diffusion.cpp:2208 - loading hparams [INFO] stable-diffusion.cpp:2214 - ftype: f16 [DEBUG] stable-diffusion.cpp:2220 - loading vocab [DEBUG] stable-diffusion.cpp:2258 - ggml tensor size = 240 bytes [INFO] stable-diffusion.cpp:2259 - params ctx size = 1970.08 MB [DEBUG] stable-diffusion.cpp:2276 - preparing memory for the weights [DEBUG] stable-diffusion.cpp:2291 - loading weights [DEBUG] stable-diffusion.cpp:2396 - model size = 1969.67MB [INFO] stable-diffusion.cpp:2399 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-f16.bin' completed, taking 0.59s [DEBUG] stable-diffusion.cpp:333 - split prompt "alps, distant alms, small church, (cinematic:1.3), intricate details, (ArtStation:1.2), nikon dlsr, masterpiece, hyperreal" to tokens ["alps", ",", "distant", "<|endoftext|>", ",", "small", "church", ",", "(", "cinematic", ":", "1", ".", "3", "),", "intricate", "details", ",", "(", "<|endoftext|>", ":", "1", ".", "2", "),", "nikon", "<|endoftext|>", ",", "masterpiece", ",", "<|endoftext|>", ] [DEBUG] stable-diffusion.cpp:2434 - condition context need 1.62MB static memory, with work_size needing 0.45MB [DEBUG] stable-diffusion.cpp:2459 - building condition graph completed: 633 nodes, 223 leafs [DEBUG] stable-diffusion.cpp:2467 - computing condition graph completed, taking 0.11s [INFO] stable-diffusion.cpp:2477 - condition graph use 4.56MB of memory: static 1.62MB, dynamic = 2.93MB [DEBUG] stable-diffusion.cpp:2481 - 236544 bytes of dynamic memory has not been released yet [DEBUG] stable-diffusion.cpp:333 - split prompt "" to tokens [] [DEBUG] stable-diffusion.cpp:2434 - condition context need 1.62MB static memory, with work_size needing 0.45MB [DEBUG] stable-diffusion.cpp:2459 - building condition graph completed: 633 nodes, 223 leafs [DEBUG] stable-diffusion.cpp:2467 - computing condition graph completed, taking 0.12s [INFO] stable-diffusion.cpp:2477 - condition graph use 4.56MB of memory: static 1.62MB, dynamic = 2.93MB [DEBUG] stable-diffusion.cpp:2481 - 236544 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2822 - get_learned_condition completed, taking 0.24s [INFO] stable-diffusion.cpp:2830 - start sampling [DEBUG] stable-diffusion.cpp:2529 - diffusion context need 69.53MB static memory, with work_size needing 67.50MB [INFO] stable-diffusion.cpp:2674 - step 1 sampling completed, taking 17.08s [DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2674 - step 2 sampling completed, taking 17.31s [DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2674 - step 3 sampling completed, taking 17.06s [DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2674 - step 4 sampling completed, taking 17.19s [DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2674 - step 5 sampling completed, taking 17.15s [DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2674 - step 6 sampling completed, taking 17.12s [DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2674 - step 7 sampling completed, taking 16.87s [DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2674 - step 8 sampling completed, taking 17.01s [DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2674 - step 9 sampling completed, taking 17.11s [DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2674 - step 10 sampling completed, taking 17.39s [DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2674 - step 11 sampling completed, taking 17.10s [DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2674 - step 12 sampling completed, taking 16.85s [DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2674 - step 13 sampling completed, taking 16.94s [DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2674 - step 14 sampling completed, taking 17.00s [DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2674 - step 15 sampling completed, taking 17.14s [DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2674 - step 16 sampling completed, taking 17.03s [DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2674 - step 17 sampling completed, taking 17.75s [DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2674 - step 18 sampling completed, taking 17.98s [DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2674 - step 19 sampling completed, taking 17.40s [DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2674 - step 20 sampling completed, taking 17.10s [DEBUG] stable-diffusion.cpp:2675 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [DEBUG] stable-diffusion.cpp:2679 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2686 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [DEBUG] stable-diffusion.cpp:2690 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2835 - sampling completed, taking 343.58s [DEBUG] stable-diffusion.cpp:2731 - vae context need 1153.12MB static memory, with work_size needing 1152.00MB [DEBUG] stable-diffusion.cpp:2757 - computing vae graph completed, taking 54.01s [INFO] stable-diffusion.cpp:2766 - vae graph use 2177.12MB of memory: static 1153.12MB, dynamic = 1024.00MB [DEBUG] stable-diffusion.cpp:2770 - 3145728 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2842 - decode_first_stage completed, taking 54.05s [INFO] stable-diffusion.cpp:2843 - txt2img completed in 397.86s, with a runtime memory usage of 2177.12MB and parameter memory usage of 1969.94MB save result image to 'output.png' ```

[DEBUG] stable-diffusion.cpp:333  - split prompt "alps, distant alms, small church, (cinematic:1.3), intricate details, (ArtStation:1.2), nikon dlsr, masterpiece, hyperreal" to tokens ["alps</w>", ",</w>", "distant</w>", "<|endoftext|>", ",</w>", "small</w>", "church</w>", ",</w>", "(</w>", "cinematic</w>", ":</w>", "1</w>", ".</w>", "3</w>", "),</w>", "intricate</w>", "details</w>", ",</w>", "(</w>", "<|endoftext|>", ":</w>", "1</w>", ".</w>", "2</w>", "),</w>", "nikon</w>", "<|endoftext|>", ",</w>", "masterpiece</w>", ",</w>", "<|endoftext|>", ]

the tokenizer really looks like it needs some work, really surprised the image came out that good.

Currently, only SD 1.x is supported. Support for SD 2.x will be added in the future.

good to hear.

Yes, a relatively large amount of memory is being used to store dynamic data (which is actually an optimized outcome). GGML currently utilizes f32 to store temporary calculation results. Changing it to f16 would reduce dynamic memory usage by half. I'm currently contemplating how to modify GGML to achieve this goal.

cant wait :smile:

Adding my first impressions here as well. I had some compile errors in my system:

stable-diffusion.cpp/stable-diffusion.cpp: In function ‘void copy_ggml_tensor(ggml_tensor*, const ggml_tensor*)’:
stable-diffusion.cpp/stable-diffusion.cpp:171:5: error: ‘memcpy’ was not declared in this scope
  171 |     memcpy(((char*)dst->data), ((char*)src->data), ggml_nbytes(dst));
      |     ^~~~~~
stable-diffusion.cpp/stable-diffusion.cpp:16:1: note: ‘memcpy’ is defined in header ‘<cstring>’; did you forget to ‘#include <cstring>’?
   15 | #include "stable-diffusion.h"
  +++ |+#include <cstring>
   16 |
stable-diffusion.cpp/stable-diffusion.cpp: In member function ‘std::vector<int> CLIPTokenizer::encode(std::string)’:
stable-diffusion.cpp/stable-diffusion.cpp:318:54: error: ‘istream_iterator’ is not a member of ‘std’
  318 |                 std::vector<std::string> tokens{std::istream_iterator<std::string>{iss},
      |                                                      ^~~~~~~~~~~~~~~~
stable-diffusion.cpp/stable-diffusion.cpp:16:1: note: ‘std::istream_iterator’ is defined in header ‘<iterator>’; did you forget to ‘#include <iterator>’?

After adding these includes (<cstring> and <iterator>) to stable-diffusion.cpp it worked great.

Even with q4_0 the results are pretty good! I got this image with the example prompt: output

Adding my first impressions here as well. I had some compile errors in my system:

stable-diffusion.cpp/stable-diffusion.cpp: In function ‘void copy_ggml_tensor(ggml_tensor*, const ggml_tensor*)’:
stable-diffusion.cpp/stable-diffusion.cpp:171:5: error: ‘memcpy’ was not declared in this scope
  171 |     memcpy(((char*)dst->data), ((char*)src->data), ggml_nbytes(dst));
      |     ^~~~~~
stable-diffusion.cpp/stable-diffusion.cpp:16:1: note: ‘memcpy’ is defined in header ‘<cstring>’; did you forget to ‘#include <cstring>’?
   15 | #include "stable-diffusion.h"
  +++ |+#include <cstring>
   16 |
stable-diffusion.cpp/stable-diffusion.cpp: In member function ‘std::vector<int> CLIPTokenizer::encode(std::string)’:
stable-diffusion.cpp/stable-diffusion.cpp:318:54: error: ‘istream_iterator’ is not a member of ‘std’
  318 |                 std::vector<std::string> tokens{std::istream_iterator<std::string>{iss},
      |                                                      ^~~~~~~~~~~~~~~~
stable-diffusion.cpp/stable-diffusion.cpp:16:1: note: ‘std::istream_iterator’ is defined in header ‘<iterator>’; did you forget to ‘#include <iterator>’?

After adding these includes (<cstring> and <iterator>) to stable-diffusion.cpp it worked great.

Even with q4_0 the results are pretty good! I got this image with the example prompt: output

Thanks for the feedback. Following your advice, I've addressed this issue in the latest commit. This compilation error might have occurred due to differences in compiler implementations. I tested with MSVC and GCC and didn't encounter this problem. May I ask which compiler you are using?

I am using gcc (Ubuntu 12.3.0-1ubuntu1~23.04) 12.3.0, which should be the current version of GCC in Ubuntu-latest.

I am using gcc (Ubuntu 12.3.0-1ubuntu1~23.04) 12.3.0, which should be the current version of GCC in Ubuntu-latest.

I'm using gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04). Haha, environmental issues can indeed be quite frustrating.

I'm using gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04). Haha, environmental issues can indeed be quite frustrating.

ah yes, a fellow ubuntu20.04 user stuck on lts :rofl:

Cool stuff!

Here is a sample run on M2 Ultra:

$ ▶ ./sd -m ../models/sd-v1-4-ggml-model-f16.bin -p "a lovely cat" -t 12
[INFO]  stable-diffusion.cpp:2191 - loading model from '../models/sd-v1-4-ggml-model-f16.bin'
[INFO]  stable-diffusion.cpp:2216 - ftype: f16
[INFO]  stable-diffusion.cpp:2261 - params ctx size =  1970.08 MB
[INFO]  stable-diffusion.cpp:2401 - loading model from '../models/sd-v1-4-ggml-model-f16.bin' completed, taking 0.72s
[INFO]  stable-diffusion.cpp:2482 - condition graph use 13.11MB of memory: static 10.17MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2482 - condition graph use 13.11MB of memory: static 10.17MB, dynamic = 2.93MB
[INFO]  stable-diffusion.cpp:2824 - get_learned_condition completed, taking 0.12s
[INFO]  stable-diffusion.cpp:2832 - start sampling
[INFO]  stable-diffusion.cpp:2676 - step 1 sampling completed, taking 5.42s
[INFO]  stable-diffusion.cpp:2676 - step 2 sampling completed, taking 5.35s
[INFO]  stable-diffusion.cpp:2676 - step 3 sampling completed, taking 5.34s
[INFO]  stable-diffusion.cpp:2676 - step 4 sampling completed, taking 5.35s
[INFO]  stable-diffusion.cpp:2676 - step 5 sampling completed, taking 5.30s
[INFO]  stable-diffusion.cpp:2676 - step 6 sampling completed, taking 5.34s
[INFO]  stable-diffusion.cpp:2676 - step 7 sampling completed, taking 5.36s
[INFO]  stable-diffusion.cpp:2676 - step 8 sampling completed, taking 5.47s
[INFO]  stable-diffusion.cpp:2676 - step 9 sampling completed, taking 5.34s
[INFO]  stable-diffusion.cpp:2676 - step 10 sampling completed, taking 5.37s
[INFO]  stable-diffusion.cpp:2676 - step 11 sampling completed, taking 5.33s
[INFO]  stable-diffusion.cpp:2676 - step 12 sampling completed, taking 5.34s
[INFO]  stable-diffusion.cpp:2676 - step 13 sampling completed, taking 5.33s
[INFO]  stable-diffusion.cpp:2676 - step 14 sampling completed, taking 5.34s
[INFO]  stable-diffusion.cpp:2676 - step 15 sampling completed, taking 5.34s
[INFO]  stable-diffusion.cpp:2676 - step 16 sampling completed, taking 5.33s
[INFO]  stable-diffusion.cpp:2676 - step 17 sampling completed, taking 5.39s
[INFO]  stable-diffusion.cpp:2676 - step 18 sampling completed, taking 5.36s
[INFO]  stable-diffusion.cpp:2676 - step 19 sampling completed, taking 5.34s
[INFO]  stable-diffusion.cpp:2676 - step 20 sampling completed, taking 5.38s
[INFO]  stable-diffusion.cpp:2691 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[INFO]  stable-diffusion.cpp:2837 - sampling completed, taking 107.12s
[INFO]  stable-diffusion.cpp:2771 - vae graph use 2177.12MB of memory: static 1153.12MB, dynamic = 1024.00MB
[INFO]  stable-diffusion.cpp:2844 - decode_first_stage completed, taking 17.86s
[INFO]  stable-diffusion.cpp:2850 - txt2img completed in 125.10s, with a runtime memory usage of 2177.12MB and parameter memory usage of 1969.94MB
save result image to 'output.png'

output

another seed

![output](https://github.com/leejet/stable-diffusion.cpp/assets/1991296/aa3e66eb-880c-4a6e-b72d-db239682889c)

What is currently the main performance bottleneck? Here is a rough breakdown of the time per op using GGML_PERF:

perf_total_per_op_us[             ADD] =   169.408 ms
perf_total_per_op_us[             MUL] =   154.503 ms
perf_total_per_op_us[          REPEAT] =   308.208 ms
perf_total_per_op_us[          CONCAT] =     8.171 ms
perf_total_per_op_us[            GELU] =     4.251 ms
perf_total_per_op_us[            SILU] =     3.978 ms
perf_total_per_op_us[            NORM] =    41.288 ms
perf_total_per_op_us[      GROUP_NORM] =    24.921 ms
perf_total_per_op_us[         MUL_MAT] =  1258.711 ms
perf_total_per_op_us[           SCALE] =    47.123 ms
perf_total_per_op_us[            CONT] =   130.151 ms
perf_total_per_op_us[         RESHAPE] =     0.970 ms
perf_total_per_op_us[            VIEW] =     0.108 ms
perf_total_per_op_us[         PERMUTE] =     0.235 ms
perf_total_per_op_us[        SOFT_MAX] =   135.226 ms
perf_total_per_op_us[         CONV_2D] =  2795.054 ms
perf_total_per_op_us[         UPSCALE] =     4.307 ms

Looks like CONV_2D needs some work.

Would be nice to upstream the new ggml operators at some point. Not sure about the "dynamic mode" though
The "concat" might be possible to achieve via view + cpy

Thank you for the feedback. Thank you for creating such amazing ggml.

Would be nice to upstream the new ggml operators at some point. Not sure about the "dynamic mode" though

OK, I will sort out the code of new operators and upstream later. I'm also considering whether to upstream the "dynamic mode".

The "concat" might be possible to achieve via view + cpy

I've tried it before，but it seems that combining view + cpy cannot fulfill the concatenation requirement along dim=1.

Any plans for sdxl?

I'm willing to implement SDXL once I've improved the support for SD 1.x and added support for SD 2.x.

Took a stab at a larger resolution 768x768

$ ./sd -t 12 -m ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "photo of a lovely cat, high quality" -n "blurry, ugly, jpeg compression, artifacts, unsharp" -v -H 768 -W 768

Details

``` Option: n_threads: 12 mode: txt2img model_path: ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin output_path: output.png init_img: prompt: photo of a lovely cat, high quality negative_prompt: blurry, ugly, jpeg compression, artifacts, unsharp cfg_scale: 7.00 width: 768 height: 768 sample_method: eular a sample_steps: 20 strength: 0.75 seed: 42 System Info: BLAS = 0 SSE3 = 1 AVX = 1 AVX2 = 1 AVX512 = 0 AVX512_VBMI = 0 AVX512_VNNI = 0 FMA = 1 NEON = 0 ARM_FMA = 0 F16C = 1 FP16_VA = 0 WASM_SIMD = 0 VSX = 0 [INFO] stable-diffusion.cpp:2500 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin' [DEBUG] stable-diffusion.cpp:2508 - verifying magic [DEBUG] stable-diffusion.cpp:2519 - loading hparams [INFO] stable-diffusion.cpp:2525 - ftype: q8_0 [DEBUG] stable-diffusion.cpp:2531 - loading vocab [DEBUG] stable-diffusion.cpp:2569 - ggml tensor size = 240 bytes [INFO] stable-diffusion.cpp:2570 - params ctx size = 1618.72 MB [DEBUG] stable-diffusion.cpp:2587 - preparing memory for the weights [DEBUG] stable-diffusion.cpp:2602 - loading weights [DEBUG] stable-diffusion.cpp:2712 - model size = 1618.31MB [INFO] stable-diffusion.cpp:2715 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin' completed, taking 0.42s [DEBUG] stable-diffusion.cpp:353 - split prompt "photo of a lovely cat, high quality" to tokens ["photo", "of", "a", "lovely", "cat", ",", "high", "quality", ] [DEBUG] stable-diffusion.cpp:2750 - condition context need 1.41MB static memory, with work_size needing 0.24MB [DEBUG] stable-diffusion.cpp:2775 - building condition graph completed: 633 nodes, 223 leafs [DEBUG] stable-diffusion.cpp:2783 - computing condition graph completed, taking 0.06s [INFO] stable-diffusion.cpp:2793 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB [DEBUG] stable-diffusion.cpp:2797 - 236544 bytes of dynamic memory has not been released yet [DEBUG] stable-diffusion.cpp:353 - split prompt "blurry, ugly, jpeg compression, artifacts, unsharp" to tokens ["blurry", ",", "ugly", ",", "<|endoftext|>", "compression", ",", "artifacts", ",", "<|endoftext|>", ] [DEBUG] stable-diffusion.cpp:2750 - condition context need 1.41MB static memory, with work_size needing 0.24MB [DEBUG] stable-diffusion.cpp:2775 - building condition graph completed: 633 nodes, 223 leafs [DEBUG] stable-diffusion.cpp:2783 - computing condition graph completed, taking 0.07s [INFO] stable-diffusion.cpp:2793 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB [DEBUG] stable-diffusion.cpp:2797 - 236544 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3243 - get_learned_condition completed, taking 0.13s [INFO] stable-diffusion.cpp:3253 - start sampling [DEBUG] stable-diffusion.cpp:2846 - diffusion context need 153.98MB static memory, with work_size needing 151.88MB [INFO] stable-diffusion.cpp:2989 - step 1 sampling completed, taking 40.92s [DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2989 - step 2 sampling completed, taking 40.72s [DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2989 - step 3 sampling completed, taking 40.61s [DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2989 - step 4 sampling completed, taking 41.12s [DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2989 - step 5 sampling completed, taking 42.76s [DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2989 - step 6 sampling completed, taking 44.84s [DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2989 - step 7 sampling completed, taking 40.99s [DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2989 - step 8 sampling completed, taking 40.95s [DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2989 - step 9 sampling completed, taking 40.95s [DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2989 - step 10 sampling completed, taking 40.93s [DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2989 - step 11 sampling completed, taking 41.00s [DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2989 - step 12 sampling completed, taking 40.87s [DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2989 - step 13 sampling completed, taking 41.78s [DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2989 - step 14 sampling completed, taking 40.99s [DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2989 - step 15 sampling completed, taking 40.93s [DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2989 - step 16 sampling completed, taking 40.87s [DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2989 - step 17 sampling completed, taking 40.98s [DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2989 - step 18 sampling completed, taking 40.96s [DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2989 - step 19 sampling completed, taking 40.78s [DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:2989 - step 20 sampling completed, taking 40.79s [DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:2994 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3001 - diffusion graph use 2832.02MB of memory: static 153.98MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:3005 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3258 - sampling completed, taking 824.76s [DEBUG] stable-diffusion.cpp:3153 - vae context need 2593.12MB static memory, with work_size needing 2592.00MB [DEBUG] stable-diffusion.cpp:3179 - computing vae graph completed, taking 104.12s [INFO] stable-diffusion.cpp:3188 - vae graph use 5271.16MB of memory: static 2593.12MB, dynamic = 2678.04MB [DEBUG] stable-diffusion.cpp:3192 - 7077888 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3265 - decode_first_stage completed, taking 104.21s [INFO] stable-diffusion.cpp:3266 - txt2img completed in 929.10s, with a runtime memory usage of 5271.16MB and parameter memory usage of 1618.58MB save result image to 'output.png' ```

output

unsurprisingly it takes way (way) longer:

[INFO]  stable-diffusion.cpp:2989 - step 1 sampling completed, taking 40.92s

Wow, this is so cool. Easy to convert existing models, quantization.. very nice.

https://github.com/bes-dev/stable_diffusion.openvino <- this is way faster though, probably due to it using OpenVINO.

my little benchmark suggests the bottleneck is not the model file, but the dynamic data. What number type do you use for it, llama.cpp has shown little to no degradation in quality when using f16 instead of f32 for the kv-cache.

I've implemented a memory optimization, and now when using txt2img with fp16 precision to generate a 512x512 image, it only requires 2.3GB.

Wow, this is so cool. Easy to convert existing models, quantization.. very nice.

https://github.com/bes-dev/stable_diffusion.openvino <- this is way faster though, probably due to it using OpenVINO.

Oh, yeah. Now I'm working hard to make it run faster.

I've implemented a memory optimization, and now when using txt2img with fp16 precision to generate a 512x512 image, it only requires 2.3GB.

is this already on master? bc i reran my diffusion above with similar timings and memory usage (? memory reporting changed)

Details

``` $ ./sd -t 12 -m ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "photo of a lovely cat, high quality" -n "blurry, ugly, jpeg compression, artifacts, unsharp" -v -H 768 -W 768 Option: n_threads: 12 mode: txt2img model_path: ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin output_path: output.png init_img: prompt: photo of a lovely cat, high quality negative_prompt: blurry, ugly, jpeg compression, artifacts, unsharp cfg_scale: 7.00 width: 768 height: 768 sample_method: eular a sample_steps: 20 strength: 0.75 seed: 42 System Info: BLAS = 0 SSE3 = 1 AVX = 1 AVX2 = 1 AVX512 = 0 AVX512_VBMI = 0 AVX512_VNNI = 0 FMA = 1 NEON = 0 ARM_FMA = 0 F16C = 1 FP16_VA = 0 WASM_SIMD = 0 VSX = 0 [INFO] stable-diffusion.cpp:2525 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin' [DEBUG] stable-diffusion.cpp:2533 - verifying magic [DEBUG] stable-diffusion.cpp:2544 - loading hparams [INFO] stable-diffusion.cpp:2550 - ftype: q8_0 [DEBUG] stable-diffusion.cpp:2556 - loading vocab [DEBUG] stable-diffusion.cpp:2584 - ggml tensor size = 272 bytes [DEBUG] stable-diffusion.cpp:2589 - clip params ctx size = 126.32 MB [DEBUG] stable-diffusion.cpp:2608 - unet params ctx size = 1399.91 MB [DEBUG] stable-diffusion.cpp:2629 - vae params ctx size = 95.51 MB [DEBUG] stable-diffusion.cpp:2650 - preparing memory for the weights [DEBUG] stable-diffusion.cpp:2666 - loading weights [DEBUG] stable-diffusion.cpp:2770 - model size = 1618.31MB [INFO] stable-diffusion.cpp:2775 - total params size = 1618.61MB (clip 125.09MB, unet 1399.01MB, vae 94.51MB) [INFO] stable-diffusion.cpp:2781 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin' completed, taking 1.03s [DEBUG] stable-diffusion.cpp:353 - split prompt "photo of a lovely cat, high quality" to tokens ["photo", "of", "a", "lovely", "cat", ",", "high", "quality", ] [DEBUG] stable-diffusion.cpp:2816 - condition context need 1.43MB static memory, with work_size needing 0.24MB [DEBUG] stable-diffusion.cpp:2841 - building condition graph completed: 633 nodes, 210 leafs [DEBUG] stable-diffusion.cpp:2849 - computing condition graph completed, taking 0.06s [INFO] stable-diffusion.cpp:2866 - condition graph use 129.45MB of memory: params 125.09MB, runtime 4.36MB (static 1.43MB, dynamic 2.93MB) [DEBUG] stable-diffusion.cpp:2875 - 236544 bytes of dynamic memory has not been released yet [DEBUG] stable-diffusion.cpp:353 - split prompt "blurry, ugly, jpeg compression, artifacts, unsharp" to tokens ["blurry", ",", "ugly", ",", "<|endoftext|>", "compression", ",", "artifacts", ",", "<|endoftext|>", ] [DEBUG] stable-diffusion.cpp:2816 - condition context need 1.43MB static memory, with work_size needing 0.24MB [DEBUG] stable-diffusion.cpp:2841 - building condition graph completed: 633 nodes, 210 leafs [DEBUG] stable-diffusion.cpp:2849 - computing condition graph completed, taking 0.06s [INFO] stable-diffusion.cpp:2866 - condition graph use 129.45MB of memory: params 125.09MB, runtime 4.36MB (static 1.43MB, dynamic 2.93MB) [DEBUG] stable-diffusion.cpp:2875 - 236544 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3359 - get_learned_condition completed, taking 0.13s [INFO] stable-diffusion.cpp:3375 - start sampling [DEBUG] stable-diffusion.cpp:2924 - diffusion context need 154.01MB static memory, with work_size needing 151.88MB [INFO] stable-diffusion.cpp:3067 - step 1 sampling completed, taking 40.27s [DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB [DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 2 sampling completed, taking 41.21s [DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB [DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 3 sampling completed, taking 42.13s [DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB [DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 4 sampling completed, taking 40.74s [DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB [DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 5 sampling completed, taking 42.93s [DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB [DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 6 sampling completed, taking 42.41s [DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB [DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 7 sampling completed, taking 40.61s [DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB [DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 8 sampling completed, taking 42.78s [DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB [DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 9 sampling completed, taking 42.33s [DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB [DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 10 sampling completed, taking 46.70s [DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB [DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 11 sampling completed, taking 42.42s [DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB [DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 12 sampling completed, taking 44.02s [DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB [DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 13 sampling completed, taking 44.62s [DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB [DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 14 sampling completed, taking 42.32s [DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB [DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 15 sampling completed, taking 40.55s [DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB [DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 16 sampling completed, taking 40.74s [DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB [DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 17 sampling completed, taking 40.20s [DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB [DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 18 sampling completed, taking 43.68s [DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB [DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 19 sampling completed, taking 40.92s [DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB [DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 20 sampling completed, taking 40.75s [DEBUG] stable-diffusion.cpp:3068 - diffusion graph use 2832.05MB runtime memory: static 154.01MB, dynamic 2678.04MB [DEBUG] stable-diffusion.cpp:3072 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3087 - diffusion graph use 4231.06MB of memory: params 1399.01MB, runtime 2832.05MB (static 154.01MB, dynamic 2678.04MB) [DEBUG] stable-diffusion.cpp:3095 - 147456 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3380 - sampling completed, taking 842.36s [DEBUG] stable-diffusion.cpp:3254 - vae context need 2593.12MB static memory, with work_size needing 2592.00MB [DEBUG] stable-diffusion.cpp:3280 - computing vae graph completed, taking 108.53s [INFO] stable-diffusion.cpp:3296 - vae graph use 5365.67MB of memory: params 94.51MB, runtime 5271.16MB (static 2593.12MB, dynamic 2678.04MB) [DEBUG] stable-diffusion.cpp:3304 - 7077888 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3393 - decode_first_stage completed, taking 108.67s [INFO] stable-diffusion.cpp:3401 - txt2img completed in 951.16s, use 5365.67MB of memory: peak params memory 1618.61MB, peak runtime memory 5271.16MB save result image to 'output.png' ```

$ ./sd -t 12 -m ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "photo of a lovely cat, high quality" -n "blurry, ugly, jpeg compression, artifacts, unsharp" -v -H 768 -W 768

Since you are generating 768x768 images, this will cause the runtime memory to grow, and there is still room for optimization

@leejet i dont think that is how that label is supposed to be used :smile:

@leejet i dont think that is how that label is supposed to be used 😄

You're right, I made a mistake. I accidentally clicked on it while browsing, it wasn't my intention.

Found this repo thanks to HN (hackernews). Had 0 issues when trying out this for the first time yesterday.

Just to share along, I've added 2 outputs of v1-5 in f16 and q4_1. This is coming from my MBP 16" (2021/M1PRO/16GB/512GB).

f16

v1-5-pruned-emaonly-ggml-model-f16.bin

> ./sd -m v1-5-pruned-emaonly-ggml-model-f16.bin -p "a lovely cat" 
[INFO]  stable-diffusion.cpp:2525 - loading model from 'v1-5-pruned-emaonly-ggml-model-f16.bin'
[INFO]  stable-diffusion.cpp:2550 - ftype: f16
[INFO]  stable-diffusion.cpp:2779 - total params size = 1969.97MB (clip 235.01MB, unet 1640.45MB, vae 94.51MB)
[INFO]  stable-diffusion.cpp:2781 - loading model from 'v1-5-pruned-emaonly-ggml-model-f16.bin' completed, taking 1.85s
[INFO]  stable-diffusion.cpp:2873 - condition graph use 248.13MB of memory: params 235.01MB, runtime 13.12MB (static 10.19MB, dynamic 2.93MB)
[INFO]  stable-diffusion.cpp:2873 - condition graph use 248.13MB of memory: params 235.01MB, runtime 13.12MB (static 10.19MB, dynamic 2.93MB)
[INFO]  stable-diffusion.cpp:3359 - get_learned_condition completed, taking 0.19s
[INFO]  stable-diffusion.cpp:3375 - start sampling
[INFO]  stable-diffusion.cpp:3067 - step 1 sampling completed, taking 9.74s
[INFO]  stable-diffusion.cpp:3067 - step 2 sampling completed, taking 9.11s
[INFO]  stable-diffusion.cpp:3067 - step 3 sampling completed, taking 9.33s
[INFO]  stable-diffusion.cpp:3067 - step 4 sampling completed, taking 9.37s
[INFO]  stable-diffusion.cpp:3067 - step 5 sampling completed, taking 9.52s
[INFO]  stable-diffusion.cpp:3067 - step 6 sampling completed, taking 8.95s
[INFO]  stable-diffusion.cpp:3067 - step 7 sampling completed, taking 9.90s
[INFO]  stable-diffusion.cpp:3067 - step 8 sampling completed, taking 9.54s
[INFO]  stable-diffusion.cpp:3067 - step 9 sampling completed, taking 8.95s
[INFO]  stable-diffusion.cpp:3067 - step 10 sampling completed, taking 9.21s
[INFO]  stable-diffusion.cpp:3067 - step 11 sampling completed, taking 9.00s
[INFO]  stable-diffusion.cpp:3067 - step 12 sampling completed, taking 9.49s
[INFO]  stable-diffusion.cpp:3067 - step 13 sampling completed, taking 9.43s
[INFO]  stable-diffusion.cpp:3067 - step 14 sampling completed, taking 9.38s
[INFO]  stable-diffusion.cpp:3067 - step 15 sampling completed, taking 9.16s
[INFO]  stable-diffusion.cpp:3067 - step 16 sampling completed, taking 9.01s
[INFO]  stable-diffusion.cpp:3067 - step 17 sampling completed, taking 8.92s
[INFO]  stable-diffusion.cpp:3067 - step 18 sampling completed, taking 9.44s
[INFO]  stable-diffusion.cpp:3067 - step 19 sampling completed, taking 9.68s
[INFO]  stable-diffusion.cpp:3067 - step 20 sampling completed, taking 9.53s
[INFO]  stable-diffusion.cpp:3094 - diffusion graph use 2264.22MB of memory: params 1640.45MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB)
[INFO]  stable-diffusion.cpp:3380 - sampling completed, taking 186.68s
[INFO]  stable-diffusion.cpp:3303 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB)
[INFO]  stable-diffusion.cpp:3393 - decode_first_stage completed, taking 28.31s
[INFO]  stable-diffusion.cpp:3407 - txt2img completed in 215.18s, use 2358.73MB of memory: peak params memory 1969.97MB, peak runtime memory 2177.12MB
save result image to 'output.png'

Expand to see re-run in verbose

``` > ./sd -m v1-5-pruned-emaonly-ggml-model-f16.bin -p "a lovely cat" -v Option: n_threads: 8 mode: txt2img model_path: v1-5-pruned-emaonly-ggml-model-f16.bin output_path: output.png init_img: prompt: a lovely cat negative_prompt: cfg_scale: 7.00 width: 512 height: 512 sample_method: eular a sample_steps: 20 strength: 0.75 seed: 42 System Info: BLAS = 1 SSE3 = 0 AVX = 0 AVX2 = 0 AVX512 = 0 AVX512_VBMI = 0 AVX512_VNNI = 0 FMA = 0 NEON = 1 ARM_FMA = 1 F16C = 0 FP16_VA = 1 WASM_SIMD = 0 VSX = 0 [INFO] stable-diffusion.cpp:2525 - loading model from 'v1-5-pruned-emaonly-ggml-model-f16.bin' [DEBUG] stable-diffusion.cpp:2533 - verifying magic [DEBUG] stable-diffusion.cpp:2544 - loading hparams [INFO] stable-diffusion.cpp:2550 - ftype: f16 [DEBUG] stable-diffusion.cpp:2556 - loading vocab [DEBUG] stable-diffusion.cpp:2584 - ggml tensor size = 272 bytes [DEBUG] stable-diffusion.cpp:2589 - clip params ctx size = 236.23 MB [DEBUG] stable-diffusion.cpp:2608 - unet params ctx size = 1641.36 MB [DEBUG] stable-diffusion.cpp:2629 - vae params ctx size = 95.51 MB [DEBUG] stable-diffusion.cpp:2650 - preparing memory for the weights [DEBUG] stable-diffusion.cpp:2666 - loading weights [DEBUG] stable-diffusion.cpp:2770 - model size = 1969.67MB [INFO] stable-diffusion.cpp:2779 - total params size = 1969.97MB (clip 235.01MB, unet 1640.45MB, vae 94.51MB) [INFO] stable-diffusion.cpp:2781 - loading model from 'v1-5-pruned-emaonly-ggml-model-f16.bin' completed, taking 1.84s [DEBUG] stable-diffusion.cpp:353 - split prompt "a lovely cat" to tokens ["a", "lovely", "cat", ] [DEBUG] stable-diffusion.cpp:2818 - condition context need 10.19MB static memory, with work_size needing 9.00MB [DEBUG] stable-diffusion.cpp:2842 - building condition graph completed: 633 nodes, 210 leafs [DEBUG] stable-diffusion.cpp:2849 - computing condition graph completed, taking 0.09s [INFO] stable-diffusion.cpp:2873 - condition graph use 248.13MB of memory: params 235.01MB, runtime 13.12MB (static 10.19MB, dynamic 2.93MB) [DEBUG] stable-diffusion.cpp:2875 - 236544 bytes of dynamic memory has not been released yet [DEBUG] stable-diffusion.cpp:353 - split prompt "" to tokens [] [DEBUG] stable-diffusion.cpp:2818 - condition context need 10.19MB static memory, with work_size needing 9.00MB [DEBUG] stable-diffusion.cpp:2842 - building condition graph completed: 633 nodes, 210 leafs [DEBUG] stable-diffusion.cpp:2849 - computing condition graph completed, taking 0.13s [INFO] stable-diffusion.cpp:2873 - condition graph use 248.13MB of memory: params 235.01MB, runtime 13.12MB (static 10.19MB, dynamic 2.93MB) [DEBUG] stable-diffusion.cpp:2875 - 236544 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3359 - get_learned_condition completed, taking 0.21s [INFO] stable-diffusion.cpp:3375 - start sampling [DEBUG] stable-diffusion.cpp:2926 - diffusion context need 69.56MB static memory, with work_size needing 67.50MB [INFO] stable-diffusion.cpp:3067 - step 1 sampling completed, taking 9.70s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 2 sampling completed, taking 9.55s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 3 sampling completed, taking 9.80s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 4 sampling completed, taking 9.65s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 5 sampling completed, taking 9.20s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 6 sampling completed, taking 9.55s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 7 sampling completed, taking 9.52s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 8 sampling completed, taking 9.12s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 9 sampling completed, taking 10.00s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 10 sampling completed, taking 9.40s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 11 sampling completed, taking 9.65s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 12 sampling completed, taking 9.91s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 13 sampling completed, taking 10.75s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 14 sampling completed, taking 9.75s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 15 sampling completed, taking 9.51s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 16 sampling completed, taking 10.05s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 17 sampling completed, taking 9.73s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 18 sampling completed, taking 10.15s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 19 sampling completed, taking 9.80s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 20 sampling completed, taking 9.50s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3094 - diffusion graph use 2264.22MB of memory: params 1640.45MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB) [DEBUG] stable-diffusion.cpp:3095 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3380 - sampling completed, taking 194.27s [DEBUG] stable-diffusion.cpp:3256 - vae context need 1153.12MB static memory, with work_size needing 1152.00MB [DEBUG] stable-diffusion.cpp:3280 - computing vae graph completed, taking 27.42s [INFO] stable-diffusion.cpp:3303 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB) [DEBUG] stable-diffusion.cpp:3304 - 3145728 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3393 - decode_first_stage completed, taking 27.47s [INFO] stable-diffusion.cpp:3407 - txt2img completed in 221.96s, use 2358.73MB of memory: peak params memory 1969.97MB, peak runtime memory 2177.12MB save result image to 'output.png' ```

q4_1

v1-5-pruned-emaonly-ggml-model-q4_1.bin

> ./sd -m v1-5-pruned-emaonly-ggml-model-q4_1.bin -p "a lovely cat"
[INFO]  stable-diffusion.cpp:2525 - loading model from 'v1-5-pruned-emaonly-ggml-model-q4_1.bin'
[INFO]  stable-diffusion.cpp:2550 - ftype: q4_1
[INFO]  stable-diffusion.cpp:2779 - total params size = 1454.64MB (clip 73.80MB, unet 1286.34MB, vae 94.51MB)
[INFO]  stable-diffusion.cpp:2781 - loading model from 'v1-5-pruned-emaonly-ggml-model-q4_1.bin' completed, taking 1.38s
[INFO]  stable-diffusion.cpp:2873 - condition graph use 86.92MB of memory: params 73.80MB, runtime 13.12MB (static 10.19MB, dynamic 2.93MB)
[INFO]  stable-diffusion.cpp:2873 - condition graph use 86.92MB of memory: params 73.80MB, runtime 13.12MB (static 10.19MB, dynamic 2.93MB)
[INFO]  stable-diffusion.cpp:3359 - get_learned_condition completed, taking 0.23s
[INFO]  stable-diffusion.cpp:3375 - start sampling
[INFO]  stable-diffusion.cpp:3067 - step 1 sampling completed, taking 9.72s
[INFO]  stable-diffusion.cpp:3067 - step 2 sampling completed, taking 9.11s
[INFO]  stable-diffusion.cpp:3067 - step 3 sampling completed, taking 9.12s
[INFO]  stable-diffusion.cpp:3067 - step 4 sampling completed, taking 10.69s
[INFO]  stable-diffusion.cpp:3067 - step 5 sampling completed, taking 9.75s
[INFO]  stable-diffusion.cpp:3067 - step 6 sampling completed, taking 9.51s
[INFO]  stable-diffusion.cpp:3067 - step 7 sampling completed, taking 9.36s
[INFO]  stable-diffusion.cpp:3067 - step 8 sampling completed, taking 9.35s
[INFO]  stable-diffusion.cpp:3067 - step 9 sampling completed, taking 9.66s
[INFO]  stable-diffusion.cpp:3067 - step 10 sampling completed, taking 9.52s
[INFO]  stable-diffusion.cpp:3067 - step 11 sampling completed, taking 9.36s
[INFO]  stable-diffusion.cpp:3067 - step 12 sampling completed, taking 9.26s
[INFO]  stable-diffusion.cpp:3067 - step 13 sampling completed, taking 9.56s
[INFO]  stable-diffusion.cpp:3067 - step 14 sampling completed, taking 9.56s
[INFO]  stable-diffusion.cpp:3067 - step 15 sampling completed, taking 9.38s
[INFO]  stable-diffusion.cpp:3067 - step 16 sampling completed, taking 9.39s
[INFO]  stable-diffusion.cpp:3067 - step 17 sampling completed, taking 10.35s
[INFO]  stable-diffusion.cpp:3067 - step 18 sampling completed, taking 9.48s
[INFO]  stable-diffusion.cpp:3067 - step 19 sampling completed, taking 9.48s
[INFO]  stable-diffusion.cpp:3067 - step 20 sampling completed, taking 9.46s
[INFO]  stable-diffusion.cpp:3094 - diffusion graph use 1910.11MB of memory: params 1286.34MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB)
[INFO]  stable-diffusion.cpp:3380 - sampling completed, taking 191.08s
[INFO]  stable-diffusion.cpp:3303 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB)
[INFO]  stable-diffusion.cpp:3393 - decode_first_stage completed, taking 27.91s
[INFO]  stable-diffusion.cpp:3407 - txt2img completed in 219.22s, use 2271.63MB of memory: peak params memory 1454.64MB, peak runtime memory 2177.12MB
save result image to 'output.png'

Expand to see re-run in verbose

``` > ./sd -m v1-5-pruned-emaonly-ggml-model-q4_1.bin -p "a lovely cat" -v Option: n_threads: 8 mode: txt2img model_path: v1-5-pruned-emaonly-ggml-model-q4_1.bin output_path: output.png init_img: prompt: a lovely cat negative_prompt: cfg_scale: 7.00 width: 512 height: 512 sample_method: eular a sample_steps: 20 strength: 0.75 seed: 42 System Info: BLAS = 1 SSE3 = 0 AVX = 0 AVX2 = 0 AVX512 = 0 AVX512_VBMI = 0 AVX512_VNNI = 0 FMA = 0 NEON = 1 ARM_FMA = 1 F16C = 0 FP16_VA = 1 WASM_SIMD = 0 VSX = 0 [INFO] stable-diffusion.cpp:2525 - loading model from 'v1-5-pruned-emaonly-ggml-model-q4_1.bin' [DEBUG] stable-diffusion.cpp:2533 - verifying magic [DEBUG] stable-diffusion.cpp:2544 - loading hparams [INFO] stable-diffusion.cpp:2550 - ftype: q4_1 [DEBUG] stable-diffusion.cpp:2556 - loading vocab [DEBUG] stable-diffusion.cpp:2584 - ggml tensor size = 272 bytes [DEBUG] stable-diffusion.cpp:2589 - clip params ctx size = 75.02 MB [DEBUG] stable-diffusion.cpp:2608 - unet params ctx size = 1287.24 MB [DEBUG] stable-diffusion.cpp:2629 - vae params ctx size = 95.51 MB [DEBUG] stable-diffusion.cpp:2650 - preparing memory for the weights [DEBUG] stable-diffusion.cpp:2666 - loading weights [DEBUG] stable-diffusion.cpp:2770 - model size = 1454.34MB [INFO] stable-diffusion.cpp:2779 - total params size = 1454.64MB (clip 73.80MB, unet 1286.34MB, vae 94.51MB) [INFO] stable-diffusion.cpp:2781 - loading model from 'v1-5-pruned-emaonly-ggml-model-q4_1.bin' completed, taking 0.87s [DEBUG] stable-diffusion.cpp:353 - split prompt "a lovely cat" to tokens ["a", "lovely", "cat", ] [DEBUG] stable-diffusion.cpp:2818 - condition context need 10.19MB static memory, with work_size needing 9.00MB [DEBUG] stable-diffusion.cpp:2842 - building condition graph completed: 633 nodes, 210 leafs [DEBUG] stable-diffusion.cpp:2849 - computing condition graph completed, taking 0.11s [INFO] stable-diffusion.cpp:2873 - condition graph use 86.92MB of memory: params 73.80MB, runtime 13.12MB (static 10.19MB, dynamic 2.93MB) [DEBUG] stable-diffusion.cpp:2875 - 236544 bytes of dynamic memory has not been released yet [DEBUG] stable-diffusion.cpp:353 - split prompt "" to tokens [] [DEBUG] stable-diffusion.cpp:2818 - condition context need 10.19MB static memory, with work_size needing 9.00MB [DEBUG] stable-diffusion.cpp:2842 - building condition graph completed: 633 nodes, 210 leafs [DEBUG] stable-diffusion.cpp:2849 - computing condition graph completed, taking 0.09s [INFO] stable-diffusion.cpp:2873 - condition graph use 86.92MB of memory: params 73.80MB, runtime 13.12MB (static 10.19MB, dynamic 2.93MB) [DEBUG] stable-diffusion.cpp:2875 - 236544 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3359 - get_learned_condition completed, taking 0.19s [INFO] stable-diffusion.cpp:3375 - start sampling [DEBUG] stable-diffusion.cpp:2926 - diffusion context need 69.56MB static memory, with work_size needing 67.50MB [INFO] stable-diffusion.cpp:3067 - step 1 sampling completed, taking 10.71s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 2 sampling completed, taking 9.55s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 3 sampling completed, taking 9.71s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 4 sampling completed, taking 9.73s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 5 sampling completed, taking 9.40s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 6 sampling completed, taking 9.15s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 7 sampling completed, taking 9.34s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 8 sampling completed, taking 9.51s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 9 sampling completed, taking 9.54s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 10 sampling completed, taking 9.44s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 11 sampling completed, taking 9.78s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 12 sampling completed, taking 9.44s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 13 sampling completed, taking 9.38s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 14 sampling completed, taking 9.49s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 15 sampling completed, taking 9.29s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 16 sampling completed, taking 9.40s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 17 sampling completed, taking 9.44s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 18 sampling completed, taking 9.72s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 19 sampling completed, taking 10.18s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3067 - step 20 sampling completed, taking 9.95s [DEBUG] stable-diffusion.cpp:3071 - diffusion graph use 623.77MB runtime memory: static 69.56MB, dynamic 554.21MB [DEBUG] stable-diffusion.cpp:3072 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3094 - diffusion graph use 1910.11MB of memory: params 1286.34MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB) [DEBUG] stable-diffusion.cpp:3095 - 65536 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3380 - sampling completed, taking 192.17s [DEBUG] stable-diffusion.cpp:3256 - vae context need 1153.12MB static memory, with work_size needing 1152.00MB [DEBUG] stable-diffusion.cpp:3280 - computing vae graph completed, taking 28.61s [INFO] stable-diffusion.cpp:3303 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB) [DEBUG] stable-diffusion.cpp:3304 - 3145728 bytes of dynamic memory has not been released yet [INFO] stable-diffusion.cpp:3393 - decode_first_stage completed, taking 28.64s [INFO] stable-diffusion.cpp:3407 - txt2img completed in 221.00s, use 2271.63MB of memory: peak params memory 1454.64MB, peak runtime memory 2177.12MB save result image to 'output.png' ```

Any chance we could get OpenVino support? Would help a lot!

leejet / stable-diffusion.cpp

First impressions info dump #1

f16

q4_1