Closed n00mkrad closed 1 year ago
See my tests here https://github.com/leejet/stable-diffusion.cpp/issues/6
Yes, I'm trying to modify GGML to make it run faster. Could you add the -v parameter to print out your System Info and Options so I can take a look?
Yes, I'm trying to modify GGML to make it run faster. Could you add the -v parameter to print out your System Info and Options so I can take a look?
Option:
n_threads: 32
mode: txt2img
model_path: models/sd-1.5-ggml-model-q4_1.bin
output_path: output.png
init_img:
prompt: photo of a lovely cat, high quality
negative_prompt: blurry, ugly, jpeg compression, artifacts, unsharp
cfg_scale: 7.50
width: 512
height: 512
sample_method: eular a
sample_steps: 20
strength: 0.75
seed: 1
System Info:
BLAS = 0
SSE3 = 0
AVX = 0
AVX2 = 0
AVX512 = 0
AVX512_VBMI = 0
AVX512_VNNI = 0
FMA = 0
NEON = 0
ARM_FMA = 0
F16C = 0
FP16_VA = 0
WASM_SIMD = 0
VSX = 0
[INFO] stable-diffusion.cpp:2500 - loading model from 'models/sd-1.5-ggml-model-q4_1.bin'
[DEBUG] stable-diffusion.cpp:2508 - verifying magic
[DEBUG] stable-diffusion.cpp:2519 - loading hparams
[INFO] stable-diffusion.cpp:2525 - ftype: q4_1
[DEBUG] stable-diffusion.cpp:2531 - loading vocab
[DEBUG] stable-diffusion.cpp:2569 - ggml tensor size = 240 bytes
[INFO] stable-diffusion.cpp:2570 - params ctx size = 1454.75 MB
[DEBUG] stable-diffusion.cpp:2587 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:2602 - loading weights
[DEBUG] stable-diffusion.cpp:2712 - model size = 1454.34MB
[INFO] stable-diffusion.cpp:2715 - loading model from 'models/sd-1.5-ggml-model-q4_1.bin' completed, taking 1.03s
[DEBUG] stable-diffusion.cpp:353 - split prompt "photo of a lovely cat, high quality" to tokens ["photo</w>", "of</w>", "a</w>", "lovely</w>", "cat</w>", ",</w>", "high</w>", "quality</w>", ]
[DEBUG] stable-diffusion.cpp:2752 - condition context need 1.46MB static memory, with work_size needing 0.28MB
[DEBUG] stable-diffusion.cpp:2776 - building condition graph completed: 633 nodes, 223 leafs
[DEBUG] stable-diffusion.cpp:2783 - computing condition graph completed, taking 0.72s
[INFO] stable-diffusion.cpp:2796 - condition graph use 4.39MB of memory: static 1.46MB, dynamic = 2.93MB
[DEBUG] stable-diffusion.cpp:2797 - 236544 bytes of dynamic memory has not been released yet
[DEBUG] stable-diffusion.cpp:353 - split prompt "blurry, ugly, jpeg compression, artifacts, unsharp" to tokens ["blurry</w>", ",</w>", "ugly</w>", ",</w>", "<|endoftext|>", "compression</w>", ",</w>", "artifacts</w>", ",</w>", "<|endoftext|>", ]
[DEBUG] stable-diffusion.cpp:2752 - condition context need 1.46MB static memory, with work_size needing 0.28MB
[DEBUG] stable-diffusion.cpp:2776 - building condition graph completed: 633 nodes, 223 leafs
[DEBUG] stable-diffusion.cpp:2783 - computing condition graph completed, taking 0.37s
[INFO] stable-diffusion.cpp:2796 - condition graph use 4.39MB of memory: static 1.46MB, dynamic = 2.93MB
[DEBUG] stable-diffusion.cpp:2797 - 236544 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:3243 - get_learned_condition completed, taking 1.10s
[INFO] stable-diffusion.cpp:3253 - start sampling
[DEBUG] stable-diffusion.cpp:2848 - diffusion context need 69.53MB static memory, with work_size needing 67.50MB
[INFO] stable-diffusion.cpp:2989 - step 1 sampling completed, taking 42.43s
[DEBUG] stable-diffusion.cpp:2993 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
System Info: BLAS = 0 SSE3 = 0 AVX = 0 AVX2 = 0 AVX512 = 0 AVX512_VBMI = 0 AVX512_VNNI = 0 FMA = 0 NEON = 0 ARM_FMA = 0 F16C = 0 FP16_VA = 0
how did you build your sd ? some features here should be enabled on any platform (AVX2 on almost all x86 cpus out there)
System Info: BLAS = 0 SSE3 = 0 AVX = 0 AVX2 = 0 AVX512 = 0 AVX512_VBMI = 0 AVX512_VNNI = 0 FMA = 0 NEON = 0 ARM_FMA = 0 F16C = 0 FP16_VA = 0
how did you build your sd ? some features here should be enabled on any platform (AVX2 on almost all x86 cpus out there)
Installed cmake and ran the commands from the readme.
I'm trying it again right now after installing CUDA and using cmake .. -DGGML_CUBLAS=ON
.
also, your number of threads seems excessive, try reducing that to match the physical core count.
also, your number of threads seems excessive, try reducing that to match the physical core count.
The default only gave me around 60% utilization. But yeah I think 32 is too much. Didn't impact performance either way though.
My compile log:
MSBuild version 17.6.3+07e294721 for .NET Framework
1>Checking Build System
Building Custom Rule stable-diffusion.cpp/ggml/src/CMakeLists.txt
Compiling CUDA source file ..\..\..\ggml\src\ggml-cuda.cu...
stable-diffusion.cpp\build\ggml\src>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\nvcc.exe" --use-local-env -ccbin "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\
HostX64\x64" -x cu -I"stable-diffusion.cpp\ggml\src\." -I"stable-diffusion.cpp\ggml\src\..\include" -I"stable-diffusion.cpp\ggml\src\..\include\ggml" -I"C:\Program Files\NVIDIA GPU Comput
ing Toolkit\CUDA\v11.8\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include" --keep-dir x64\Release -maxrregcount=0 --machine 64 --compile -cudart static --generate-code=arch=compute_52,code=[compute_52,s
m_52] --generate-code=arch=compute_61,code=[compute_61,sm_61] -Xcompiler="/EHsc -Ob2" -D_WINDOWS -DNDEBUG -DGGML_USE_CUBLAS -D"CMAKE_INTDIR=\"Release\"" -D_MBCS -DWIN32 -D_WINDOWS -DNDEBUG -DGGML_USE_CUBLAS -D"CMAKE_INTDIR=\"Release
\"" -Xcompiler "/EHsc /W3 /nologo /O2 /Fdstable-diffusion.cpp\build\ggml\src\Release\ggml.pdb /FS /MD /GR" -o ggml.dir\Release\ggml-cuda.obj "stable-diffusion.cpp\ggml\src\ggml-cuda.cu"
ggml-cuda.cu
cl : command line warning D9002: ignoring unknown option '-mfma' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
ggml.c
cl : command line warning D9002: ignoring unknown option '-mf16c' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
cl : command line warning D9002: ignoring unknown option '-mavx' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
cl : command line warning D9002: ignoring unknown option '-mavx2' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
ggml.vcxproj -> stable-diffusion.cpp\build\ggml\src\Release\ggml.lib
Building Custom Rule stable-diffusion.cpp/CMakeLists.txt
stable-diffusion.cpp
stable-diffusion.vcxproj -> stable-diffusion.cpp\build\Release\stable-diffusion.lib
Building Custom Rule stable-diffusion.cpp/CMakeLists.txt
main.cpp
stable-diffusion.cpp\stb_image_write.h(776,13): warning C4996: 'sprintf': This function or variable may be unsafe. Consider using sprintf_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for
details. [stable-diffusion.cpp\build\sd.vcxproj]
sd.vcxproj -> stable-diffusion.cpp\build\Release\sd.exe
Building Custom Rule stable-diffusion.cpp/CMakeLists.txt
It's at 30 seconds per sampling step now. I wonder about this part:
cl : command line warning D9002: ignoring unknown option '-mfma' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
ggml.c
cl : command line warning D9002: ignoring unknown option '-mf16c' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
cl : command line warning D9002: ignoring unknown option '-mavx' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
cl : command line warning D9002: ignoring unknown option '-mavx2' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
Does that imply it failed to enable AVX/AVX2 stuff?
It's at 30 seconds per sampling step now. I wonder about this part:
....
Does that imply it failed to enable AVX/AVX2 stuff?
no, i think thats the cuda compiler.
or maybe not? hm what is your platform/what platform are you building for
ran it with my built + adjusted threads to 10 (i have 12 physical)
$ ./sd -t 10 -m ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "photo of a lovely cat, high quality" -n "blurry, ugly, jpeg compression, artifacts, unsharp" -v
Option:
n_threads: 10
mode: txt2img
model_path: ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin
output_path: output.png
init_img:
prompt: photo of a lovely cat, high quality
negative_prompt: blurry, ugly, jpeg compression, artifacts, unsharp
cfg_scale: 7.00
width: 512
height: 512
sample_method: eular a
sample_steps: 20
strength: 0.75
seed: 42
System Info:
BLAS = 0
SSE3 = 1
AVX = 1
AVX2 = 1
AVX512 = 0
AVX512_VBMI = 0
AVX512_VNNI = 0
FMA = 1
NEON = 0
ARM_FMA = 0
F16C = 1
FP16_VA = 0
WASM_SIMD = 0
VSX = 0
[INFO] stable-diffusion.cpp:2500 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin'
[DEBUG] stable-diffusion.cpp:2508 - verifying magic
[DEBUG] stable-diffusion.cpp:2519 - loading hparams
[INFO] stable-diffusion.cpp:2525 - ftype: q8_0
[DEBUG] stable-diffusion.cpp:2531 - loading vocab
[DEBUG] stable-diffusion.cpp:2569 - ggml tensor size = 240 bytes
[INFO] stable-diffusion.cpp:2570 - params ctx size = 1618.72 MB
[DEBUG] stable-diffusion.cpp:2587 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:2602 - loading weights
[DEBUG] stable-diffusion.cpp:2712 - model size = 1618.31MB
[INFO] stable-diffusion.cpp:2715 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin' completed, taking 0.44s
[DEBUG] stable-diffusion.cpp:353 - split prompt "photo of a lovely cat, high quality" to tokens ["photo</w>", "of</w>", "a</w>", "lovely</w>", "cat</w>", ",</w>", "high</w>", "quality</w>", ]
[DEBUG] stable-diffusion.cpp:2750 - condition context need 1.41MB static memory, with work_size needing 0.24MB
[DEBUG] stable-diffusion.cpp:2775 - building condition graph completed: 633 nodes, 223 leafs
[DEBUG] stable-diffusion.cpp:2783 - computing condition graph completed, taking 0.05s
[INFO] stable-diffusion.cpp:2793 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[DEBUG] stable-diffusion.cpp:2797 - 236544 bytes of dynamic memory has not been released yet
[DEBUG] stable-diffusion.cpp:353 - split prompt "blurry, ugly, jpeg compression, artifacts, unsharp" to tokens ["blurry</w>", ",</w>", "ugly</w>", ",</w>", "<|endoftext|>", "compression</w>", ",</w>", "artifacts</w>", ",</w>", "<|endoftext|>", ]
[DEBUG] stable-diffusion.cpp:2750 - condition context need 1.41MB static memory, with work_size needing 0.24MB
[DEBUG] stable-diffusion.cpp:2775 - building condition graph completed: 633 nodes, 223 leafs
[DEBUG] stable-diffusion.cpp:2783 - computing condition graph completed, taking 0.05s
[INFO] stable-diffusion.cpp:2793 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[DEBUG] stable-diffusion.cpp:2797 - 236544 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:3243 - get_learned_condition completed, taking 0.10s
[INFO] stable-diffusion.cpp:3253 - start sampling
[DEBUG] stable-diffusion.cpp:2846 - diffusion context need 69.53MB static memory, with work_size needing 67.50MB
[INFO] stable-diffusion.cpp:2989 - step 1 sampling completed, taking 15.96s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:2989 - step 2 sampling completed, taking 15.68s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:2989 - step 3 sampling completed, taking 15.83s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:2989 - step 4 sampling completed, taking 15.90s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:2989 - step 5 sampling completed, taking 15.93s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:2989 - step 6 sampling completed, taking 15.79s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:2989 - step 7 sampling completed, taking 15.78s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:2989 - step 8 sampling completed, taking 15.66s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:2989 - step 9 sampling completed, taking 15.71s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:2989 - step 10 sampling completed, taking 15.85s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:2989 - step 11 sampling completed, taking 15.78s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:2989 - step 12 sampling completed, taking 15.76s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:2989 - step 13 sampling completed, taking 15.85s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:2989 - step 14 sampling completed, taking 15.90s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:2989 - step 15 sampling completed, taking 15.84s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:2989 - step 16 sampling completed, taking 16.07s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:2989 - step 17 sampling completed, taking 15.88s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:2989 - step 18 sampling completed, taking 15.98s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:2989 - step 19 sampling completed, taking 15.89s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:2989 - step 20 sampling completed, taking 15.76s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:3001 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:3005 - 65536 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:3258 - sampling completed, taking 316.81s
[DEBUG] stable-diffusion.cpp:3153 - vae context need 1153.12MB static memory, with work_size needing 1152.00MB
[DEBUG] stable-diffusion.cpp:3179 - computing vae graph completed, taking 50.49s
[INFO] stable-diffusion.cpp:3188 - vae graph use 2177.12MB of memory: static 1153.12MB, dynamic = 1024.00MB
[DEBUG] stable-diffusion.cpp:3192 - 3145728 bytes of dynamic memory has not been released yet
[INFO] stable-diffusion.cpp:3265 - decode_first_stage completed, taking 50.53s
[INFO] stable-diffusion.cpp:3266 - txt2img completed in 367.45s, with a runtime memory usage of 2177.12MB and parameter memory usage of 1618.58MB
save result image to 'output.png'
edit: also i used q8_0 instead of q4_1
Well this part is definitely off:
System Info:
BLAS = 1
SSE3 = 0
AVX = 0
AVX2 = 0
AVX512 = 0
AVX512_VBMI = 0
AVX512_VNNI = 0
FMA = 0
NEON = 0
ARM_FMA = 0
F16C = 0
FP16_VA = 0
WASM_SIMD = 0
VSX = 0
I assume the lack of AVX is a compiler issue. But no idea how to fix that, it seems to be up to date.
OH, tell me more about your build environment/process
because
it is trying to enable gcc
/clang
avx2 (-mavx2
) on a msvc
compiler
If you poke around in your build directoy, you should fine a CMakeCache.txt
, inside there you can add /arch:AVX2
to CMAKE_CXX_FLAGS:STRING=
and CMAKE_C_FLAGS:STRING=
OH, tell me more about your build environment/process because it is trying to enable
gcc
/clang
avx2 (-mavx2
) on amsvc
compiler
Windows 10 22H2, VS 2022 with Build Tools installed, CUDA Toolkit 11.8 installed, cmake installed using their setup.
If you poke around in your build directoy, you should fine a
CMakeCache.txt
, inside there you can add/arch:AVX2
toCMAKE_CXX_FLAGS:STRING=
andCMAKE_C_FLAGS:STRING=
That didn't seem to change anything. I ran cmake --build . --config Release
again and same result.
very funky, @leejet i will probably make a pr later with improved cmake (by copying from llama.cpp)
very funky, @leejet i will probably make a pr later with improved cmake (by copying from llama.cpp)
The latest GGML code has already fixed this issue. I will rebase my code onto the latest GGML code.
Does that imply it failed to enable AVX/AVX2 stuff?
@n00mkrad the issue has been fixed. You can pull the latest code and give it a try. Don't forget to update the submodule as well.
git pull origin master
git submodule update
Works. Still very slow, but I guess that's expected. About 7 sec per step with CuBLAS, 30 sec without.
Running the line from the readme, I get this:
step 1 sampling completed, taking 50.97s
Compiled with cmake on Windows. Shouldn't it be a little bit faster?