Open Green-Sky opened 1 year ago
Hey, finally stable diffusion for ggml 😄
Did a test run
$ ./sd -t 8 -m ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "alps, distant alms, small church, (cinematic:1.3), intricate details, (ArtStation:1.2), nikon dlsr, masterpiece, hyperreal" [INFO] stable-diffusion.cpp:2189 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin' [INFO] stable-diffusion.cpp:2214 - ftype: q8_0 [INFO] stable-diffusion.cpp:2259 - params ctx size = 1618.72 MB [INFO] stable-diffusion.cpp:2399 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin' completed, taking 0.46s [INFO] stable-diffusion.cpp:2477 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB [INFO] stable-diffusion.cpp:2477 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB [INFO] stable-diffusion.cpp:2822 - get_learned_condition completed, taking 0.16s [INFO] stable-diffusion.cpp:2830 - start sampling [INFO] stable-diffusion.cpp:2674 - step 1 sampling completed, taking 18.34s [INFO] stable-diffusion.cpp:2674 - step 2 sampling completed, taking 18.24s [INFO] stable-diffusion.cpp:2674 - step 3 sampling completed, taking 18.65s [INFO] stable-diffusion.cpp:2674 - step 4 sampling completed, taking 18.41s [INFO] stable-diffusion.cpp:2674 - step 5 sampling completed, taking 18.31s [INFO] stable-diffusion.cpp:2674 - step 6 sampling completed, taking 18.18s [INFO] stable-diffusion.cpp:2674 - step 7 sampling completed, taking 18.21s [INFO] stable-diffusion.cpp:2674 - step 8 sampling completed, taking 18.29s [INFO] stable-diffusion.cpp:2674 - step 9 sampling completed, taking 18.21s [INFO] stable-diffusion.cpp:2674 - step 10 sampling completed, taking 18.28s [INFO] stable-diffusion.cpp:2674 - step 11 sampling completed, taking 18.19s [INFO] stable-diffusion.cpp:2674 - step 12 sampling completed, taking 18.00s [INFO] stable-diffusion.cpp:2674 - step 13 sampling completed, taking 18.03s [INFO] stable-diffusion.cpp:2674 - step 14 sampling completed, taking 18.54s [INFO] stable-diffusion.cpp:2674 - step 15 sampling completed, taking 18.32s [INFO] stable-diffusion.cpp:2674 - step 16 sampling completed, taking 18.41s [INFO] stable-diffusion.cpp:2674 - step 17 sampling completed, taking 18.29s [INFO] stable-diffusion.cpp:2674 - step 18 sampling completed, taking 18.51s [INFO] stable-diffusion.cpp:2674 - step 19 sampling completed, taking 18.62s [INFO] stable-diffusion.cpp:2674 - step 20 sampling completed, taking 18.11s [INFO] stable-diffusion.cpp:2686 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB [INFO] stable-diffusion.cpp:2835 - sampling completed, taking 366.14s [INFO] stable-diffusion.cpp:2766 - vae graph use 2177.12MB of memory: static 1153.12MB, dynamic = 1024.00MB [INFO] stable-diffusion.cpp:2842 - decode_first_stage completed, taking 57.66s [INFO] stable-diffusion.cpp:2843 - txt2img completed in 423.96s, with a runtime memory usage of 2177.12MB and parameter memory usage of 1618.58MB save result image to 'output.png'
Painpoint: the extra python libs for conversion. Got a pip install error bc i have an incompatible version of something installed already,
convert.py
worked anyway though. :)Timings: I used the q8_0 quantization and ran with different thread counts: I have a 12core(24threads) cpu. I took the timing of a sampling step.
quant q8_0 q4_0 f16 -t 1 75.31s 75.20s 82.92s -t 2 42.44s
-t 4 28.65s 29.23s 30.00s -t 6 21.68s
-t 8 18.34s 18.89s 19.05s -t 10 16.38s 16.78s 17.61s -t 12 16.26s 16.98s 18.11s -t 14 17.93s
-t 16 16.80s
-t 18 16.70s
-t 20 16.20s
-t 22 16.96s
-t 24 18.93s
Additional questions:
- do you have/plan to support token weighing? ( eg:
(cinematic:1.3)
)- are you looking into supporting cuda/opencl backends from ggml?
- are you looking into k-quants (like llama.cpp) and some form of quality mesurement of quantizations? (since k-quants use different quant for different parts of the model)
- it would be nice if the tool printed the "system line" (see https://github.com/ggerganov/llama.cpp/blob/f64d44a9b9581cd58f7ec40f4fa1c3ca5ca18e1e/llama.cpp#L4267 )
- did not see it mentioned, does it support sd 2.x / do you plan to add support for that ?
- my little benchmark suggests the bottleneck is not the model file, but the dynamic data. What number type do you use for it, llama.cpp has shown little to no degradation in quality when using f16 instead of f32 for the kv-cache.
edit: added f16 timings
Thanks for the feedback.
- Yes, I'm preparing to support an tokenizer in the style of stable-diffusion-webui, which includes token weighing.
very nice
- I'm working on adding GPU support and currently focusing on getting ggml_conv_2d to function on the GPU. Because ggml_conv_2d only supports CPU now.
i see
- You can add the -v or --verbose parameter, which will allow you to see the system info.
oh, i overlooked that one
[DEBUG] stable-diffusion.cpp:333 - split prompt "alps, distant alms, small church, (cinematic:1.3), intricate details, (ArtStation:1.2), nikon dlsr, masterpiece, hyperreal" to tokens ["alps</w>", ",</w>", "distant</w>", "<|endoftext|>", ",</w>", "small</w>", "church</w>", ",</w>", "(</w>", "cinematic</w>", ":</w>", "1</w>", ".</w>", "3</w>", "),</w>", "intricate</w>", "details</w>", ",</w>", "(</w>", "<|endoftext|>", ":</w>", "1</w>", ".</w>", "2</w>", "),</w>", "nikon</w>", "<|endoftext|>", ",</w>", "masterpiece</w>", ",</w>", "<|endoftext|>", ]
the tokenizer really looks like it needs some work, really surprised the image came out that good.
- Currently, only SD 1.x is supported. Support for SD 2.x will be added in the future.
good to hear.
- Yes, a relatively large amount of memory is being used to store dynamic data (which is actually an optimized outcome). GGML currently utilizes f32 to store temporary calculation results. Changing it to f16 would reduce dynamic memory usage by half. I'm currently contemplating how to modify GGML to achieve this goal.
cant wait :smile:
Adding my first impressions here as well. I had some compile errors in my system:
stable-diffusion.cpp/stable-diffusion.cpp: In function ‘void copy_ggml_tensor(ggml_tensor*, const ggml_tensor*)’:
stable-diffusion.cpp/stable-diffusion.cpp:171:5: error: ‘memcpy’ was not declared in this scope
171 | memcpy(((char*)dst->data), ((char*)src->data), ggml_nbytes(dst));
| ^~~~~~
stable-diffusion.cpp/stable-diffusion.cpp:16:1: note: ‘memcpy’ is defined in header ‘<cstring>’; did you forget to ‘#include <cstring>’?
15 | #include "stable-diffusion.h"
+++ |+#include <cstring>
16 |
stable-diffusion.cpp/stable-diffusion.cpp: In member function ‘std::vector<int> CLIPTokenizer::encode(std::string)’:
stable-diffusion.cpp/stable-diffusion.cpp:318:54: error: ‘istream_iterator’ is not a member of ‘std’
318 | std::vector<std::string> tokens{std::istream_iterator<std::string>{iss},
| ^~~~~~~~~~~~~~~~
stable-diffusion.cpp/stable-diffusion.cpp:16:1: note: ‘std::istream_iterator’ is defined in header ‘<iterator>’; did you forget to ‘#include <iterator>’?
After adding these includes (<cstring>
and <iterator>
) to stable-diffusion.cpp
it worked great.
Even with q4_0
the results are pretty good! I got this image with the example prompt:
Adding my first impressions here as well. I had some compile errors in my system:
stable-diffusion.cpp/stable-diffusion.cpp: In function ‘void copy_ggml_tensor(ggml_tensor*, const ggml_tensor*)’: stable-diffusion.cpp/stable-diffusion.cpp:171:5: error: ‘memcpy’ was not declared in this scope 171 | memcpy(((char*)dst->data), ((char*)src->data), ggml_nbytes(dst)); | ^~~~~~ stable-diffusion.cpp/stable-diffusion.cpp:16:1: note: ‘memcpy’ is defined in header ‘<cstring>’; did you forget to ‘#include <cstring>’? 15 | #include "stable-diffusion.h" +++ |+#include <cstring> 16 | stable-diffusion.cpp/stable-diffusion.cpp: In member function ‘std::vector<int> CLIPTokenizer::encode(std::string)’: stable-diffusion.cpp/stable-diffusion.cpp:318:54: error: ‘istream_iterator’ is not a member of ‘std’ 318 | std::vector<std::string> tokens{std::istream_iterator<std::string>{iss}, | ^~~~~~~~~~~~~~~~ stable-diffusion.cpp/stable-diffusion.cpp:16:1: note: ‘std::istream_iterator’ is defined in header ‘<iterator>’; did you forget to ‘#include <iterator>’?
After adding these includes (
<cstring>
and<iterator>
) tostable-diffusion.cpp
it worked great.Even with
q4_0
the results are pretty good! I got this image with the example prompt:
Thanks for the feedback. Following your advice, I've addressed this issue in the latest commit. This compilation error might have occurred due to differences in compiler implementations. I tested with MSVC and GCC and didn't encounter this problem. May I ask which compiler you are using?
I am using gcc (Ubuntu 12.3.0-1ubuntu1~23.04) 12.3.0
, which should be the current version of GCC in Ubuntu-latest.
I am using
gcc (Ubuntu 12.3.0-1ubuntu1~23.04) 12.3.0
, which should be the current version of GCC in Ubuntu-latest.
I'm using gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
. Haha, environmental issues can indeed be quite frustrating.
I'm using
gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)
. Haha, environmental issues can indeed be quite frustrating.
ah yes, a fellow ubuntu20.04 user stuck on lts :rofl:
Cool stuff!
Here is a sample run on M2 Ultra:
$ ▶ ./sd -m ../models/sd-v1-4-ggml-model-f16.bin -p "a lovely cat" -t 12
[INFO] stable-diffusion.cpp:2191 - loading model from '../models/sd-v1-4-ggml-model-f16.bin'
[INFO] stable-diffusion.cpp:2216 - ftype: f16
[INFO] stable-diffusion.cpp:2261 - params ctx size = 1970.08 MB
[INFO] stable-diffusion.cpp:2401 - loading model from '../models/sd-v1-4-ggml-model-f16.bin' completed, taking 0.72s
[INFO] stable-diffusion.cpp:2482 - condition graph use 13.11MB of memory: static 10.17MB, dynamic = 2.93MB
[INFO] stable-diffusion.cpp:2482 - condition graph use 13.11MB of memory: static 10.17MB, dynamic = 2.93MB
[INFO] stable-diffusion.cpp:2824 - get_learned_condition completed, taking 0.12s
[INFO] stable-diffusion.cpp:2832 - start sampling
[INFO] stable-diffusion.cpp:2676 - step 1 sampling completed, taking 5.42s
[INFO] stable-diffusion.cpp:2676 - step 2 sampling completed, taking 5.35s
[INFO] stable-diffusion.cpp:2676 - step 3 sampling completed, taking 5.34s
[INFO] stable-diffusion.cpp:2676 - step 4 sampling completed, taking 5.35s
[INFO] stable-diffusion.cpp:2676 - step 5 sampling completed, taking 5.30s
[INFO] stable-diffusion.cpp:2676 - step 6 sampling completed, taking 5.34s
[INFO] stable-diffusion.cpp:2676 - step 7 sampling completed, taking 5.36s
[INFO] stable-diffusion.cpp:2676 - step 8 sampling completed, taking 5.47s
[INFO] stable-diffusion.cpp:2676 - step 9 sampling completed, taking 5.34s
[INFO] stable-diffusion.cpp:2676 - step 10 sampling completed, taking 5.37s
[INFO] stable-diffusion.cpp:2676 - step 11 sampling completed, taking 5.33s
[INFO] stable-diffusion.cpp:2676 - step 12 sampling completed, taking 5.34s
[INFO] stable-diffusion.cpp:2676 - step 13 sampling completed, taking 5.33s
[INFO] stable-diffusion.cpp:2676 - step 14 sampling completed, taking 5.34s
[INFO] stable-diffusion.cpp:2676 - step 15 sampling completed, taking 5.34s
[INFO] stable-diffusion.cpp:2676 - step 16 sampling completed, taking 5.33s
[INFO] stable-diffusion.cpp:2676 - step 17 sampling completed, taking 5.39s
[INFO] stable-diffusion.cpp:2676 - step 18 sampling completed, taking 5.36s
[INFO] stable-diffusion.cpp:2676 - step 19 sampling completed, taking 5.34s
[INFO] stable-diffusion.cpp:2676 - step 20 sampling completed, taking 5.38s
[INFO] stable-diffusion.cpp:2691 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[INFO] stable-diffusion.cpp:2837 - sampling completed, taking 107.12s
[INFO] stable-diffusion.cpp:2771 - vae graph use 2177.12MB of memory: static 1153.12MB, dynamic = 1024.00MB
[INFO] stable-diffusion.cpp:2844 - decode_first_stage completed, taking 17.86s
[INFO] stable-diffusion.cpp:2850 - txt2img completed in 125.10s, with a runtime memory usage of 2177.12MB and parameter memory usage of 1969.94MB
save result image to 'output.png'
GGML_PERF
:
perf_total_per_op_us[ ADD] = 169.408 ms
perf_total_per_op_us[ MUL] = 154.503 ms
perf_total_per_op_us[ REPEAT] = 308.208 ms
perf_total_per_op_us[ CONCAT] = 8.171 ms
perf_total_per_op_us[ GELU] = 4.251 ms
perf_total_per_op_us[ SILU] = 3.978 ms
perf_total_per_op_us[ NORM] = 41.288 ms
perf_total_per_op_us[ GROUP_NORM] = 24.921 ms
perf_total_per_op_us[ MUL_MAT] = 1258.711 ms
perf_total_per_op_us[ SCALE] = 47.123 ms
perf_total_per_op_us[ CONT] = 130.151 ms
perf_total_per_op_us[ RESHAPE] = 0.970 ms
perf_total_per_op_us[ VIEW] = 0.108 ms
perf_total_per_op_us[ PERMUTE] = 0.235 ms
perf_total_per_op_us[ SOFT_MAX] = 135.226 ms
perf_total_per_op_us[ CONV_2D] = 2795.054 ms
perf_total_per_op_us[ UPSCALE] = 4.307 ms
Looks like CONV_2D
needs some work.
ggml
operators at some point. Not sure about the "dynamic mode" thoughview
+ cpy
Thank you for the feedback. Thank you for creating such amazing ggml.
- Would be nice to upstream the new
ggml
operators at some point. Not sure about the "dynamic mode" though
OK, I will sort out the code of new operators and upstream later. I'm also considering whether to upstream the "dynamic mode".
- The "concat" might be possible to achieve via
view
+cpy
I've tried it before,but it seems that combining view
+ cpy
cannot fulfill the concatenation requirement along dim=1.
Any plans for sdxl?
Any plans for sdxl?
I'm willing to implement SDXL once I've improved the support for SD 1.x and added support for SD 2.x.
Took a stab at a larger resolution 768x768
$ ./sd -t 12 -m ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "photo of a lovely cat, high quality" -n "blurry, ugly, jpeg compression, artifacts, unsharp" -v -H 768 -W 768
unsurprisingly it takes way (way) longer:
[INFO] stable-diffusion.cpp:2989 - step 1 sampling completed, taking 40.92s
Wow, this is so cool. Easy to convert existing models, quantization.. very nice.
https://github.com/bes-dev/stable_diffusion.openvino <- this is way faster though, probably due to it using OpenVINO.
- my little benchmark suggests the bottleneck is not the model file, but the dynamic data. What number type do you use for it, llama.cpp has shown little to no degradation in quality when using f16 instead of f32 for the kv-cache.
I've implemented a memory optimization, and now when using txt2img with fp16 precision to generate a 512x512 image, it only requires 2.3GB.
Wow, this is so cool. Easy to convert existing models, quantization.. very nice.
https://github.com/bes-dev/stable_diffusion.openvino <- this is way faster though, probably due to it using OpenVINO.
Oh, yeah. Now I'm working hard to make it run faster.
I've implemented a memory optimization, and now when using txt2img with fp16 precision to generate a 512x512 image, it only requires 2.3GB.
is this already on master? bc i reran my diffusion above with similar timings and memory usage (? memory reporting changed)
$ ./sd -t 12 -m ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "photo of a lovely cat, high quality" -n "blurry, ugly, jpeg compression, artifacts, unsharp" -v -H 768 -W 768
Since you are generating 768x768 images, this will cause the runtime memory to grow, and there is still room for optimization
@leejet i dont think that is how that label is supposed to be used :smile:
@leejet i dont think that is how that label is supposed to be used 😄
You're right, I made a mistake. I accidentally clicked on it while browsing, it wasn't my intention.
Found this repo thanks to HN (hackernews). Had 0 issues when trying out this for the first time yesterday.
Just to share along, I've added 2 outputs of v1-5 in f16 and q4_1. This is coming from my MBP 16" (2021/M1PRO/16GB/512GB).
v1-5-pruned-emaonly-ggml-model-f16.bin
> ./sd -m v1-5-pruned-emaonly-ggml-model-f16.bin -p "a lovely cat"
[INFO] stable-diffusion.cpp:2525 - loading model from 'v1-5-pruned-emaonly-ggml-model-f16.bin'
[INFO] stable-diffusion.cpp:2550 - ftype: f16
[INFO] stable-diffusion.cpp:2779 - total params size = 1969.97MB (clip 235.01MB, unet 1640.45MB, vae 94.51MB)
[INFO] stable-diffusion.cpp:2781 - loading model from 'v1-5-pruned-emaonly-ggml-model-f16.bin' completed, taking 1.85s
[INFO] stable-diffusion.cpp:2873 - condition graph use 248.13MB of memory: params 235.01MB, runtime 13.12MB (static 10.19MB, dynamic 2.93MB)
[INFO] stable-diffusion.cpp:2873 - condition graph use 248.13MB of memory: params 235.01MB, runtime 13.12MB (static 10.19MB, dynamic 2.93MB)
[INFO] stable-diffusion.cpp:3359 - get_learned_condition completed, taking 0.19s
[INFO] stable-diffusion.cpp:3375 - start sampling
[INFO] stable-diffusion.cpp:3067 - step 1 sampling completed, taking 9.74s
[INFO] stable-diffusion.cpp:3067 - step 2 sampling completed, taking 9.11s
[INFO] stable-diffusion.cpp:3067 - step 3 sampling completed, taking 9.33s
[INFO] stable-diffusion.cpp:3067 - step 4 sampling completed, taking 9.37s
[INFO] stable-diffusion.cpp:3067 - step 5 sampling completed, taking 9.52s
[INFO] stable-diffusion.cpp:3067 - step 6 sampling completed, taking 8.95s
[INFO] stable-diffusion.cpp:3067 - step 7 sampling completed, taking 9.90s
[INFO] stable-diffusion.cpp:3067 - step 8 sampling completed, taking 9.54s
[INFO] stable-diffusion.cpp:3067 - step 9 sampling completed, taking 8.95s
[INFO] stable-diffusion.cpp:3067 - step 10 sampling completed, taking 9.21s
[INFO] stable-diffusion.cpp:3067 - step 11 sampling completed, taking 9.00s
[INFO] stable-diffusion.cpp:3067 - step 12 sampling completed, taking 9.49s
[INFO] stable-diffusion.cpp:3067 - step 13 sampling completed, taking 9.43s
[INFO] stable-diffusion.cpp:3067 - step 14 sampling completed, taking 9.38s
[INFO] stable-diffusion.cpp:3067 - step 15 sampling completed, taking 9.16s
[INFO] stable-diffusion.cpp:3067 - step 16 sampling completed, taking 9.01s
[INFO] stable-diffusion.cpp:3067 - step 17 sampling completed, taking 8.92s
[INFO] stable-diffusion.cpp:3067 - step 18 sampling completed, taking 9.44s
[INFO] stable-diffusion.cpp:3067 - step 19 sampling completed, taking 9.68s
[INFO] stable-diffusion.cpp:3067 - step 20 sampling completed, taking 9.53s
[INFO] stable-diffusion.cpp:3094 - diffusion graph use 2264.22MB of memory: params 1640.45MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB)
[INFO] stable-diffusion.cpp:3380 - sampling completed, taking 186.68s
[INFO] stable-diffusion.cpp:3303 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB)
[INFO] stable-diffusion.cpp:3393 - decode_first_stage completed, taking 28.31s
[INFO] stable-diffusion.cpp:3407 - txt2img completed in 215.18s, use 2358.73MB of memory: peak params memory 1969.97MB, peak runtime memory 2177.12MB
save result image to 'output.png'
v1-5-pruned-emaonly-ggml-model-q4_1.bin
> ./sd -m v1-5-pruned-emaonly-ggml-model-q4_1.bin -p "a lovely cat"
[INFO] stable-diffusion.cpp:2525 - loading model from 'v1-5-pruned-emaonly-ggml-model-q4_1.bin'
[INFO] stable-diffusion.cpp:2550 - ftype: q4_1
[INFO] stable-diffusion.cpp:2779 - total params size = 1454.64MB (clip 73.80MB, unet 1286.34MB, vae 94.51MB)
[INFO] stable-diffusion.cpp:2781 - loading model from 'v1-5-pruned-emaonly-ggml-model-q4_1.bin' completed, taking 1.38s
[INFO] stable-diffusion.cpp:2873 - condition graph use 86.92MB of memory: params 73.80MB, runtime 13.12MB (static 10.19MB, dynamic 2.93MB)
[INFO] stable-diffusion.cpp:2873 - condition graph use 86.92MB of memory: params 73.80MB, runtime 13.12MB (static 10.19MB, dynamic 2.93MB)
[INFO] stable-diffusion.cpp:3359 - get_learned_condition completed, taking 0.23s
[INFO] stable-diffusion.cpp:3375 - start sampling
[INFO] stable-diffusion.cpp:3067 - step 1 sampling completed, taking 9.72s
[INFO] stable-diffusion.cpp:3067 - step 2 sampling completed, taking 9.11s
[INFO] stable-diffusion.cpp:3067 - step 3 sampling completed, taking 9.12s
[INFO] stable-diffusion.cpp:3067 - step 4 sampling completed, taking 10.69s
[INFO] stable-diffusion.cpp:3067 - step 5 sampling completed, taking 9.75s
[INFO] stable-diffusion.cpp:3067 - step 6 sampling completed, taking 9.51s
[INFO] stable-diffusion.cpp:3067 - step 7 sampling completed, taking 9.36s
[INFO] stable-diffusion.cpp:3067 - step 8 sampling completed, taking 9.35s
[INFO] stable-diffusion.cpp:3067 - step 9 sampling completed, taking 9.66s
[INFO] stable-diffusion.cpp:3067 - step 10 sampling completed, taking 9.52s
[INFO] stable-diffusion.cpp:3067 - step 11 sampling completed, taking 9.36s
[INFO] stable-diffusion.cpp:3067 - step 12 sampling completed, taking 9.26s
[INFO] stable-diffusion.cpp:3067 - step 13 sampling completed, taking 9.56s
[INFO] stable-diffusion.cpp:3067 - step 14 sampling completed, taking 9.56s
[INFO] stable-diffusion.cpp:3067 - step 15 sampling completed, taking 9.38s
[INFO] stable-diffusion.cpp:3067 - step 16 sampling completed, taking 9.39s
[INFO] stable-diffusion.cpp:3067 - step 17 sampling completed, taking 10.35s
[INFO] stable-diffusion.cpp:3067 - step 18 sampling completed, taking 9.48s
[INFO] stable-diffusion.cpp:3067 - step 19 sampling completed, taking 9.48s
[INFO] stable-diffusion.cpp:3067 - step 20 sampling completed, taking 9.46s
[INFO] stable-diffusion.cpp:3094 - diffusion graph use 1910.11MB of memory: params 1286.34MB, runtime 623.77MB (static 69.56MB, dynamic 554.21MB)
[INFO] stable-diffusion.cpp:3380 - sampling completed, taking 191.08s
[INFO] stable-diffusion.cpp:3303 - vae graph use 2271.63MB of memory: params 94.51MB, runtime 2177.12MB (static 1153.12MB, dynamic 1024.00MB)
[INFO] stable-diffusion.cpp:3393 - decode_first_stage completed, taking 27.91s
[INFO] stable-diffusion.cpp:3407 - txt2img completed in 219.22s, use 2271.63MB of memory: peak params memory 1454.64MB, peak runtime memory 2177.12MB
save result image to 'output.png'
Any chance we could get OpenVino support? Would help a lot!
Hey, finally stable diffusion for ggml :smile:
Did a test run
Painpoint: the extra python libs for conversion. Got a pip install error bc i have an incompatible version of something installed already,
convert.py
worked anyway though. :)Additional questions:
(cinematic:1.3)
)edit: added f16 timings