Add vulkan backend - Githubissues

sohzm commented 3 months ago

issue: https://github.com/leejet/stable-diffusion.cpp/issues/256

Looks like theyre doing some changes to vulkan shader generation in ggml repo, and its currently broken. Will keep and eye on it and update the pr accordingly.

sohzm commented 3 months ago

(im new to this, so I might have made some mistakes. I would be grateful for any guidance or feedback)

0cc4m commented 3 months ago

Hey, nice to see someone working on this. I'd like to get this to work. There's probably some ops that need to be supported by Vulkan upstream, right? I can help with that.

sohzm commented 3 months ago

@0cc4m Thanks for offering help.

Currently the hpp file generated by ggml_vk_generate_shaders.py does not have types like mul_mat_vec_id_q3_k_f32_len, div_f32_len etc

Also some types were renamed eg: dequant_q5_k_len is imported in ggml/src/ggml-vulkan.cpp but header file has dequant_q5_K_len

Im assuming these issues will be solved by your work in llama.cpp? please correct me if Im wrong

Also let me know if I can help with anything

0cc4m commented 3 months ago

@0cc4m Thanks for offering help.

Currently the hpp file generated by ggml_vk_generate_shaders.py does not have types like mul_mat_vec_id_q3_k_f32_len, div_f32_len etc

Also some types were renamed eg: dequant_q5_k_len is imported in ggml/src/ggml-vulkan.cpp but header file has dequant_q5_K_len

Im assuming these issues will be solved by your work in llama.cpp? please correct me if Im wrong

Also let me know if I can help with anything

It is working in Llama.cpp. I'll take a look at the status in ggml, maybe that needs an update.

Cloudwalk9 commented 2 months ago

I manually wired up Vulkan and compiled SD.cpp with the latest ggml modified with llama.cpp's modifications to Vulkan. It runs and loads a model, but their Vulkan shaders do not implement CONCAT and it fails.

./sd -m ~/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors --prompt "score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash" -W 1024 -H 1024 -v
Option: 
    n_threads:         8
    mode:              txt2img
    model_path:        /home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors
    wtype:             unspecified
    vae_path:          
    taesd_path:        
    esrgan_path:       
    controlnet_path:   
    embeddings_path:   
    stacked_id_embeddings_path:   
    input_id_images_path:   
    style ratio:       20.00
    normzalize input image :  false
    output_path:       output.png
    init_img:          
    control_image:     
    clip on cpu:       false
    controlnet cpu:    false
    vae decoder on cpu:false
    strength(control): 0.90
    prompt:            score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash
    negative_prompt:   
    min_cfg:           1.00
    cfg_scale:         7.00
    clip_skip:         -1
    width:             1024
    height:            1024
    sample_method:     euler_a
    schedule:          default
    sample_steps:      20
    strength(img2img): 0.75
    rng:               cuda
    seed:              42
    batch_count:       1
    vae_tiling:        false
    upscale_repeats:   1
System Info: 
    BLAS = 1
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 1
    AVX512_VBMI = 1
    AVX512_VNNI = 1
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:158  - Using Vulkan backend
ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: NVIDIA RTX A4000 Laptop GPU (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32
[INFO ] stable-diffusion.cpp:178  - loading model from '/home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors'
[INFO ] model.cpp:737  - load /home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors using safetensors format
[DEBUG] model.cpp:803  - init from '/home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors'
[INFO ] stable-diffusion.cpp:201  - Stable Diffusion XL 
[INFO ] stable-diffusion.cpp:207  - Stable Diffusion weight type: f16
[DEBUG] stable-diffusion.cpp:208  - ggml tensor size = 400 bytes
[WARN ] stable-diffusion.cpp:213  - !!!It looks like you are using SDXL model. If you find that the generated images are completely black, try specifying SDXL VAE FP16 Fix with the --vae parameter. You can find it here: https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/blob/main/sdxl_vae.safetensors
[DEBUG] ggml_extend.hpp:884  - clip params backend buffer size =  1564.36 MB(VRAM) (713 tensors)
[DEBUG] ggml_extend.hpp:884  - unet params backend buffer size =  4900.07 MB(VRAM) (1680 tensors)
[DEBUG] ggml_extend.hpp:884  - vae params backend buffer size =  94.47 MB(VRAM) (140 tensors)
[DEBUG] stable-diffusion.cpp:309  - loading vocab
[DEBUG] clip.hpp:164  - vocab size: 49408
[DEBUG] clip.hpp:175  -  trigger word img already in vocab
[DEBUG] stable-diffusion.cpp:329  - loading weights
[DEBUG] model.cpp:1380 - loading tensors from /home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors
[INFO ] stable-diffusion.cpp:413  - total params memory size = 6558.89MB (VRAM 6558.89MB, RAM 0.00MB): clip 1564.36MB(VRAM), unet 4900.07MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:432  - loading model from '/home/david/Desktop/Misc/stable_diffusion/a1111/models/Stable-diffusion/ponyDiffusionV6XL_v6.safetensors' completed, taking 4.34s
[INFO ] stable-diffusion.cpp:449  - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:482  - finished loaded file
[DEBUG] stable-diffusion.cpp:1452 - txt2img 1024x1024
[DEBUG] stable-diffusion.cpp:1207 - prompt after extract and remove lora: "score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash"
[INFO ] stable-diffusion.cpp:565  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1212 - apply_loras completed, taking 0.00s
[DEBUG] clip.hpp:1312 - parse 'score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash' to [['score_9, score_8_up, score_7_up, score_6_up, score_5_up, rainbow dash', 1], ]
[DEBUG] clip.hpp:1152 - token length: 77
[DEBUG] ggml_extend.hpp:838  - clip compute buffer size: 2.56 MB(VRAM)
ggml_vulkan: Error: Missing op: CONCAT
GGML_ASSERT: /home/david/Desktop/Dev/ggml/stable-diffusion.cpp/ggml/src/ggml-vulkan.cpp:5533: false
Aborted (core dumped)

Cloudwalk9 commented 2 months ago

After adding CONCAT to the relevant place (probably not the solution for that?), it makes it a little further but still fails here:

ggml_backend_vk_graph_compute: error: op not supported  (view) (UNARY)
GGML_ASSERT: /home/david/Desktop/Dev/ggml/stable-diffusion.cpp/ggml/src/ggml-vulkan.cpp:6227: ok

At this point it's beyond my knowledge/skill.

0cc4m commented 2 months ago

@Cloudwalk9 Thank you for trying it, I can add the missing ops. Can you upload your progress to a branch that I can access?

Cloudwalk9 commented 2 months ago

@0cc4m Done, but it's pretty crude. I updated the submodule to point to my fork of ggml with the imported Vulkan stuff, also had to fix some headers. https://github.com/Cloudwalk9/stable-diffusion.cpp

Cloudwalk9 commented 2 months ago

@0cc4m They just synced the newer Vulkan shader code (split into individual files) from llama.cpp to upstream ggml, so you could probably target ggml directly, instead of my forked submodule.

0cc4m commented 2 months ago

@0cc4m They just synced the newer Vulkan shader code (split into individual files) from llama.cpp to upstream ggml, so you could probably target ggml directly, instead of my forked submodule.

Yeah, my WIP branch is here: https://github.com/0cc4m/ggml/tree/vulkan-stable-diffusion-ops

I implemented all the ops, but there's still some bug that makes the image not adhere to the prompt. I'll investigate that later.

SkutteOleg commented 2 months ago

@0cc4m They just synced the newer Vulkan shader code (split into individual files) from llama.cpp to upstream ggml, so you could probably target ggml directly, instead of my forked submodule.

Yeah, my WIP branch is here: https://github.com/0cc4m/ggml/tree/vulkan-stable-diffusion-ops

I implemented all the ops, but there's still some bug that makes the image not adhere to the prompt. I'll investigate that later.

Great work, thank you!

Some ops appear to still be missing when I try to use LoRA (res-adapter):

lora.hpp:67   - finished loaded lora`
lora.hpp:175  - (18 / 18) LoRA tensors applied successfully
ggml_extend.hpp:841  - lora compute buffer size: 112.85 MB(VRAM)
lora.hpp:175  - (18 / 18) LoRA tensors applied successfully
ggml_vulkan: Error: Missing op: ADD for f16 and f32 to f16
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-vulkan.cpp:4149: fatal error

A different error occurs when I try to use TAESD:

stable-diffusion.cpp:1398 - generating 1 latent images completed, taking 46.07s
stable-diffusion.cpp:1401 - decoding 1 latents
ggml_extend.hpp:841  - taesd compute buffer size: 480.00 MB(VRAM)
ggml_backend_vk_graph_compute: error: op not supported  (view) (UNARY)
D:\a\stable-diffusion.cpp\stable-diffusion.cpp\ggml\src\ggml-vulkan.cpp:6432: GGML_ASSERT(ok) failed

Cloudwalk9 commented 2 months ago

We're finally about to see Stable Diffusion where the only major dependency is your graphics driver...

0cc4m commented 2 months ago

@SkutteOleg Thank you, those should be easy to add. I fixed the first bug that caused issues, but I ran into another matmul bug that I have to find in the shader code. I hope I can find it soon.

0cc4m commented 2 months ago

LORA and TAESD should work now. I also fixed the matmul bug. It's generating images correctly in my tests, but not that fast yet.

SkutteOleg commented 2 months ago

LORA and TAESD should work now. I also fixed the matmul bug. It's generating images correctly in my tests, but not that fast yet.

It is amazing, actually. It's 2.5 times faster than CUDA12 on my end 😲 (perhaps due to lower memory usage, but i'm not sure)

0cc4m commented 2 months ago

LORA and TAESD should work now. I also fixed the matmul bug. It's generating images correctly in my tests, but not that fast yet.

It is amazing, actually. It's 2.5 times faster than CUDA12 on my end 😲 (perhaps due to lower memory usage, but i'm not sure)

On which hardware?

SkutteOleg commented 2 months ago

On which hardware?

NVIDIA GeForce GTX 1660 SUPER

EDIT: Also confirmed working reasonably fast on Steam Deck.

SkutteOleg commented 2 months ago

It's 2.5 times faster than CUDA12 on my end 😲 (perhaps due to lower memory usage, but i'm not sure)

I had time to do some further testing. Apparently I was comparing the speed to a previous build of sd.cpp. It turns out CUDA12 image generation speed also got faster after ggml update. Even still, Vulkan is 20% faster. However, I was wrong about memory. It appears that Vulkan uses more memory as I can no longer fit both llama.cpp and stable-diffusion.cpp on the GPU at the same time.

UPD: I was testing at 512x512 before. When trying 1024x1024 Vulkan is indeed 15% slower for me. Also, at 1024x1024 it produces broken outputs on my hardware: vulkan_2 vulkan_4

maxargy commented 2 months ago

LORA and TAESD should work now. I also fixed the matmul bug. It's generating images correctly in my tests, but not that fast yet.

Excellent work, for me works fine, tested with intel ARC a580

0cc4m commented 2 months ago

UPD: I was testing at 512x512 before. When trying 1024x1024 Vulkan is indeed 15% slower for me. Also, at 1024x1024 it produces broken outputs on my hardware.

This is a problem with a very large buffer that sd.cpp requests for VAE decoding (?). I cannot fix that on the Vulkan side, but I am throwing an exception now so that it crashes instead of just generating garbage output. Maybe @leejet can think of a solution? Vulkan has a restriction on how large VRAM buffers can be (usually 4GB), and 1024x1024 VAE decoding requests a buffer larger than that.

Green-Sky commented 2 months ago

there should be VAE-tiling available, or fallback to cpu (not exposed as a cli option afaik).

SkutteOleg commented 2 months ago

This is a problem with a very large buffer that sd.cpp requests for VAE decoding (?). I cannot fix that on the Vulkan side, but I am throwing an exception now so that it crashes instead of just generating garbage output. Maybe @leejet can think of a solution? Vulkan has a restriction on how large VRAM buffers can be (usually 4GB), and 1024x1024 VAE decoding requests a buffer larger than that.

Shouldn't VAE tiling help with that? This occurs for me even with VAE tiling enabled.

JohnArlow commented 2 months ago

Excellent work, well done. Pictures are generated at 384x384 on my Intel i5-1035G1. output

JohnArlow commented 2 months ago

Using the --vae-on-cpu option it will do 512x512 images. Don't understand why VAE should be such a problem, the compute buffer size is 1.6GB in ram. YetanotherCat

offbeat-stuff commented 2 months ago

Tried the vulkan repo from Skuttle, vulkan sdcpp -> 2.12 it/s cuda sdcpp -> 3.95 it/s comfyui -> 1.27 it/s

Nvidia gtx 1650 ti mobile Fedora 40

nearly identical images, though why are some patches different b/w cuda and vulkan?

0cc4m commented 2 months ago

This is a problem with a very large buffer that sd.cpp requests for VAE decoding (?). I cannot fix that on the Vulkan side, but I am throwing an exception now so that it crashes instead of just generating garbage output. Maybe @leejet can think of a solution? Vulkan has a restriction on how large VRAM buffers can be (usually 4GB), and 1024x1024 VAE decoding requests a buffer larger than that.

Shouldn't VAE tiling help with that? This occurs for me even with VAE tiling enabled.

It should, and it does in my tests. I can generate 1024x1024 images with SDXL by using --vae-tiling or --vae-on-cpu.

why are some patches different b/w cuda and vulkan?

There are slight differences in how the CUDA and Vulkan backends calculate, for example the CUDA backend uses tensor cores for matrix multiplication, while the Vulkan backend (on Nvidia GPUs) uses the regular CUDA cores. That can change the results slightly. There might also be some minor differences in other operations that contribute to that, too.

maxargy commented 2 months ago

I tried the img2img mode but it immediately raises an error ggml_vulkan: Error: Missing op: PAD

0cc4m commented 2 months ago

I tried the img2img mode but it immediately raises an error ggml_vulkan: Error: Missing op: PAD

Thank you for reporting that, I forgot to check img2img. It should work now.

daniandtheweb commented 2 months ago

When trying to load any embedding I get this missing vulkan operator:

ggml_vulkan: Error: Missing op: CONCAT for f16 and f16 to f16

0cc4m commented 2 months ago

When trying to load any embedding I get this missing vulkan operator:
ggml_vulkan: Error: Missing op: CONCAT for f16 and f16 to f16

I can implement that, but it's odd considering that f16 CONCAT is not even implemented for CPU or CUDA. Do embeddings work with those?

daniandtheweb commented 2 months ago

At least with cpu it doesn't work, I'm unable to test cuda

0cc4m commented 2 months ago

At least with cpu it doesn't work, I'm unable to test cuda

You can try to build with this branch: https://github.com/0cc4m/ggml/tree/vulkan-stable-diffusion-ops-concat-f16

daniandtheweb commented 2 months ago

I've cherry-picked the changes from your branch and the embeddings now work fine, I'm not sure why on cpu the program crashed.

Cloudwalk9 commented 2 months ago

1.5 it/s on an SD 1.5 model at 512x512 Euler A 20 steps. The performance drops quadratically with resolution. I get 7 it/s on ComfyUI with same settings.

BUT, CUDA backend here is 3 it/s on the same settings and performance also tanks with resolution in the same manner. RTX A4000 Mobile, which is about equivalent to a desktop 3060 Ti.

There's still a LOT of room for optimization.

Also I found a bug I intend to submit here and upstream. I encountered one of my model weights interpreted as "F64"(?!) that works in standard web UIs but not ggml/sd.cpp, regardless of --type setting.

daniandtheweb commented 2 months ago

@0cc4m , sorry to bother you again but I've found another missing op, this time related to upscaling:

upscaler.cpp:49   - upscaling from (512 x 512) to (2048 x 2048)
Errors: ggml_vulkan: Found 1 Vulkan devices:
Vulkan0: AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | warp size: 64
ggml_vulkan: Error: Missing op: LEAKY_RELU

0cc4m commented 1 month ago

sorry to bother you again but I've found another missing op, this time related to upscaling:

No problem, I added LEAKY_RELU. Please try it.

daniandtheweb commented 1 month ago

No problem, I added LEAKY_RELU. Please try it.

Upscaling works fine now, however I've also just found out that quantized models don't seem to work on my pc:

Vulkan0: AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | warp size: 64
/home/daniandtheweb/Applications/stable-diffusion/test/stable-diffusion.cpp/ggml/src/ggml-vulkan.cpp:4145: GGML_ASSERT(op == GGML_OP_GET_ROWS || (!ggml_is_quantized(src0->type) && (src1 == nullptr || !ggml_is_quantized(src1->type)))) failed
ptrace: Operation not permitted.
No stack.
The program is not being run.

The quantized models don't work only when an embedding or a lora are used, without any of them the program runs just fine.

0cc4m commented 1 month ago

No problem, I added LEAKY_RELU. Please try it.

Upscaling works fine now, however I've also just found out that quantized models don't seem to work on my pc:
Vulkan0: AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | warp size: 64
/home/daniandtheweb/Applications/stable-diffusion/test/stable-diffusion.cpp/ggml/src/ggml-vulkan.cpp:4145: GGML_ASSERT(op == GGML_OP_GET_ROWS || (!ggml_is_quantized(src0->type) && (src1 == nullptr || !ggml_is_quantized(src1->type)))) failed
ptrace: Operation not permitted.
No stack.
The program is not being run.
The quantized models don't work only when an embedding or a lora are used, without any of them the program runs just fine.

Yeah, I think embedding and lora only working on non-quantized models is expected. Does that work on CPU/CUDA?

daniandtheweb commented 1 month ago

On cpu lora + quantized model works (I can't test embeddings as for some reason I can't get them to work at all on cpu)

Green-Sky commented 1 month ago

https://github.com/ggerganov/ggml/pull/904

the required ops where merged upstream

0cc4m commented 1 month ago

i got a error when using a fp8 model

ggml/src/ggml-vulkan.cpp:2111: GGML_ASSERT(idx < vk_instance.device_indices.size()) failed

That's a generic error that means the vulkan device you requested doesn't exist. If you didn't request one manually (with the environment variable GGML_VK_VISIBLE_DEVICES) then it means you don't have a Vulkan device. Missing driver maybe? Try running vulkaninfo --summary to check what devices are available.

rhjdvsgsgks commented 1 month ago

i got a error when using a fp8 model
ggml/src/ggml-vulkan.cpp:2111: GGML_ASSERT(idx < vk_instance.device_indices.size()) failed
That's a generic error that means the vulkan device you requested doesn't exist. If you didn't request one manually (with the environment variable GGML_VK_VISIBLE_DEVICES) then it means you don't have a Vulkan device. Missing driver maybe? Try running vulkaninfo --summary to check what devices are available.

sorry for the confusion. i forgot to pick https://github.com/Cloudwalk9/stable-diffusion.cpp/commit/a4f071a3188f6fc967ab3180235f2096fe1a02d8 . after apply that i got a error from sd.cpp itself (https://github.com/leejet/stable-diffusion.cpp/issues/329#issuecomment-2271714386). which is not the error of ggml vulkan backend

msglm commented 1 month ago

Attempting to compile this, getting quite a lot of errors all of a similar type:

/tmp/guix-build-stable-diffusion-cpp-vulkan-15553084e5d1a7d6abd3aac176bee95b0247c379.drv-0/source/ggml/src/ggml-vulkan.cpp:1108:109: error: ‘matmul_f32_f16_aligned_len’ was not declared in this scope; did you mean ‘matmul_f32_aligned_len’?
 1108 |         ggml_vk_create_pipeline(ctx, ctx->device->pipeline_matmul_f32_f16->a_l, "matmul_f32_f16_aligned_l", matmul_f32_f16_aligned_len, matmul_f32_f16_aligned_data, "main", 3, sizeof(vk_mat_mat_push_constants), l_wg_denoms, warptile_l, l_align);
      |                                                                                                             ^~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                                                                                             matmul_f32_aligned_len
/tmp/guix-build-stable-diffusion-cpp-vulkan-15553084e5d1a7d6abd3aac176bee95b0247c379.drv-0/source/ggml/src/ggml-vulkan.cpp:1108:137: error: ‘matmul_f32_f16_aligned_data’ was not declared in this scope; did you mean ‘matmul_f32_aligned_data’?
 1108 |         ggml_vk_create_pipeline(ctx, ctx->device->pipeline_matmul_f32_f16->a_l, "matmul_f32_f16_aligned_l", matmul_f32_f16_aligned_len, matmul_f32_f16_aligned_data, "main", 3, sizeof(vk_mat_mat_push_constants), l_wg_denoms, warptile_l, l_align);
      |                                                                                                                                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                                                                                                                         matmul_f32_aligned_data
/tmp/guix-build-stable-diffusion-cpp-vulkan-15553084e5d1a7d6abd3aac176bee95b0247c379.drv-0/source/ggml/src/ggml-vulkan.cpp:1196:97: error: ‘matmul_id_f32_f32_len’ was not declared in this scope; did you mean ‘matmul_f32_fp32_len’?
 1196 |         ggml_vk_create_pipeline(ctx, ctx->device->pipeline_matmul_id_f32->l, "matmul_id_f32_l", matmul_id_f32_f32_len, matmul_id_f32_f32_data, "main", 4, sizeof(vk_mat_mat_id_push_constants), l_wg_denoms, warptile_l, 1);
      |                                                                                                 ^~~~~~~~~~~~~~~~~~~~~
      |                                                                                                 matmul_f32_fp32_len
/tmp/guix-build-stable-diffusion-cpp-vulkan-15553084e5d1a7d6abd3aac176bee95b0247c379.drv-0/source/ggml/src/ggml-vulkan.cpp:1196:120: error: ‘matmul_id_f32_f32_data’ was not declared in this scope; did you mean ‘matmul_f32_fp32_data’?
 1196 |         ggml_vk_create_pipeline(ctx, ctx->device->pipeline_matmul_id_f32->l, "matmul_id_f32_l", matmul_id_f32_f32_len, matmul_id_f32_f32_data, "main", 4, sizeof(vk_mat_mat_id_push_constants), l_wg_denoms, warptile_l, 1);
      |                                                                                                                        ^~~~~~~~~~~~~~~~~~~~~~
      |                                                                                                                        matmul_f32_fp32_data

I did run the python script to generate shader headers, so those are in ggml's src/ directory, as expected. Perhaps an issue with the script not generating the right shaders for my device? I'm using an Nvidia GTX 1050 Ti with the proprietary drivers (I hope to try this with Nouveau using NVK at a later date as well).

Full build log using Guix

Any help to get this working would be appreciated.

0cc4m commented 1 month ago

@msglm That looks like the ggml-version you're using is outdated. You have to use an up-to-date one that contains the Vulkan operators I've implemented to allow running stable diffusion. One part of that is that the Python script to generate shaders has been deleted and replaced with a C++ program that is automatically called by Make/CMake as part of the build process.

msglm commented 1 month ago

@0cc4m Seems that your suggestion didn't fix the core of the issue, since it's still not able to properly generate (seemingly, for similar reasons). The problem seemingly lies in the actual generation of the shaders.

The failed build log in full: guix-build-vanilla.log

The version of GLSLC I have on my machine is 2024.01.3.280.01.3.280.0 (Target: SPIR-V 1.0) (since that's the tool doing the generating and heavy lifting). I decided to change FILE* spv = fopen(path.c_str(), "rb"); to FILE* spv = fopen(path.c_str(), "wb+"); in vulkan-shaders-gen.cpp to see if I could force the files to be generated if they don't exist, but The result was the following warnings:

/tmp/guix-build-stable-diffusion-cpp-vulkan-15553084e5d1a7d6abd3aac176bee95b0247c379.drv-0/build/ggml/src/ggml-vulkan-shaders.hpp:660:41: warning: ISO C++ forbids zero-size array ‘get_rows_q4_1_data’ [-Wpedantic]
  660 | extern unsigned char get_rows_q4_1_data[0];
      |                                         ^
/tmp/guix-build-stable-diffusion-cpp-vulkan-15553084e5d1a7d6abd3aac176bee95b0247c379.drv-0/build/ggml/src/ggml-vulkan-shaders.hpp:663:36: warning: ISO C++ forbids zero-size array ‘gelu_f32_data’ [-Wpedantic]
  663 | extern unsigned char gelu_f32_data[0];
      |                                    ^
/tmp/guix-build-stable-diffusion-cpp-vulkan-15553084e5d1a7d6abd3aac176bee95b0247c379.drv-0/build/ggml/src/ggml-vulkan-shaders.hpp:666:41: warning: ISO C++ forbids zero-size array ‘get_rows_q5_0_data’ [-Wpedantic]
  666 | extern unsigned char get_rows_q5_0_data[0];
      |                                         ^
/tmp/guix-build-stable-diffusion-cpp-vulkan-15553084e5d1a7d6abd3aac176bee95b0247c379.drv-0/build/ggml/src/ggml-vulkan-shaders.hpp:669:41: warning: ISO C++ forbids zero-size array ‘get_rows_q5_1_data’ [-Wpedantic]
  669 | extern unsigned char get_rows_q5_1_data[0];
      |                                         ^
/tmp/guix-build-stable-diffusion-cpp-vulkan-15553084e5d1a7d6abd3aac176bee95b0247c379.drv-0/build/ggml/src/ggml-vulkan-shaders.hpp:672:39: warning: ISO C++ forbids zero-size array ‘cpy_f16_f16_data’ [-Wpedantic]
  672 | extern unsigned char cpy_f16_f16_data[0];
      |                                       ^
/tmp/guix-build-stable-diffusion-cpp-vulkan-15553084e5d1a7d6abd3aac176bee95b0247c379.drv-0/build/ggml/src/ggml-vulkan-shaders.hpp:675:54: warning: ISO C++ forbids zero-size array ‘matmul_id_q4_0_f32_aligned_data’ [-Wpedantic]
  675 | extern unsigned char matmul_id_q4_0_f32_aligned_data[0];
      |                                                      ^
/tmp/guix-build-stable-diffusion-cpp-vulkan-15553084e5d1a7d6abd3aac176bee95b0247c379.drv-0/build/ggml/src/ggml-vulkan-shaders.hpp:678:50: warning: ISO C++ forbids zero-size array ‘mul_mat_vec_nc_f16_f32_data’ [-Wpedantic]
  678 | extern unsigned char mul_mat_vec_nc_f16_f32_data[0];
      |                                                  ^
/tmp/guix-build-stable-diffusion-cpp-vulkan-15553084e5d1a7d6abd3aac176bee95b0247c379.drv-0/build/ggml/src/ggml-vulkan-shaders.hpp:681:51: warning: ISO C++ forbids zero-size array ‘matmul_id_q4_1_f32_fp32_data’ [-Wpedantic]
  681 | extern unsigned char matmul_id_q4_1_f32_fp32_data[0];
      |                                                   ^
/tmp/guix-build-stable-diffusion-cpp-vulkan-15553084e5d1a7d6abd3aac176bee95b0247c379.drv-0/build/ggml/src/ggml-vulkan-shaders.hpp:684:46: warning: ISO C++ forbids zero-size array ‘matmul_id_q5_0_f32_data’ [-Wpedantic]
  684 | extern unsigned char matmul_id_q5_0_f32_data[0];

These warnings make me believe that vulkan-shaders-gen.cpp is just improperly generating the shaders on my machine (as they're zero-sized arrays).

Full build log for the attempt with removed error-checking can be found here: guix-build-modded.log

EDIT: Ignore everything after "I decided to change". I forgot that rb was for reading, I thought it was for writing.

0cc4m commented 1 month ago

These are problems arising from the package manager you are using, which I am unfamiliar with. You'd have to look into why vulkan-shaders-gen fails to compile the SPIR-V files correctly (issue with glslc?) or in the correct location on your system:

[ 15%] Built target vulkan-shaders-gen
make  -f ggml/src/CMakeFiles/ggml.dir/build.make ggml/src/CMakeFiles/ggml.dir/depend
make[2]: Entering directory '/tmp/guix-build-stable-diffusion-cpp-vulkan-15553084e5d1a7d6abd3aac176bee95b0247c379.drv-0/build'
[ 21%] Generate vulkan shaders
cd /tmp/guix-build-stable-diffusion-cpp-vulkan-15553084e5d1a7d6abd3aac176bee95b0247c379.drv-0/build/ggml/src && ../../bin/vulkan-shaders-gen --glslc /gnu/store/hkkwk7dhcsfy9r43zqi82yi3blhdb4zf-shaderc-2024.0/bin/glslc --input-dir /tmp/guix-build-stable-diffusion-cpp-vulkan-15553084e5d1a7d6abd3aac176bee95b0247c379.drv-0/source/ggml/src/vulkan-shaders --output-dir /tmp/guix-build-stable-diffusion-cpp-vulkan-15553084e5d1a7d6abd3aac176bee95b0247c379.drv-0/build/ggml/src/vulkan-shaders.spv --target-hpp /tmp/guix-build-stable-diffusion-cpp-vulkan-15553084e5d1a7d6abd3aac176bee95b0247c379.drv-0/build/ggml/src/ggml-vulkan-shaders.hpp --target-cpp /tmp/guix-build-stable-diffusion-cpp-vulkan-15553084e5d1a7d6abd3aac176bee95b0247c379.drv-0/build/ggml/src/ggml-vulkan-shaders.cpp --no-clean
ggml_vulkan: Generating and compiling shaders to SPIR-V
Error opening SPIR-V file: /tmp/guix-build-stable-diffusion-cpp-vulkan-15553084e5d1a7d6abd3aac176bee95b0247c379.drv-0/build/ggml/src/vulkan-shaders.spv/matmul_f32_f16_fp32.spv (No such file or directory)

0cc4m commented 1 month ago

I see also that this branch is out of date. @sohzm Do you want to update this (@SkutteOleg has opened a PR in your fork)? Otherwise this should be closed and a new PR should be opened (maybe by @SkutteOleg )

sohzm commented 1 month ago

Thanks @0cc4m for pointing out the pr, I had missed it.

Ive merged it and also added changes to make it work with the recent ggml commit https://github.com/ggerganov/ggml/commit/fc31d407910e0ebeee5f1c0f65cada2ad211d1da in this commit: https://github.com/leejet/stable-diffusion.cpp/pull/291/commits/41ca4d509d005213b6be95ebd46f6b9bee90263b

Im fixing the merge conflicts to get this pr ready ~~but can you confirm if the changes related to the epsilon as a parameter for group_norm are correct as I dont know what its about.~~ (changes were in the master branch, I didnt lookup earlier, sorry to bother you)

Ive fixed the pr, and I think it can be merged now?

msglm commented 1 month ago

I was able to fix my issue with my package manager by editing gguf's vulkan-shaders-gen.cpp to use popen(command.c_str(), "r"); rather than execl("/bin/sh", "sh", "-c", command.c_str(), (char*) nullptr); to start glslc for compiling the shaders. It now builds, recognizing my GPU, and runs properly. There may be no need to upstream this fix since Guix supports having custom patches for building projects (its a source-based package manager), though popen may be preferred since it supports writing command output to stdout (useful for debugging errors related to glslc).

Example of generation: /gnu/store/nb7a2af09zayajxkj3mx2d4mpyp0bfhh-profile/bin/sd -m "./v1-5-pruned-emaonly.safetensors" --vae ./sdxl_vae.safetensors -p "Spanish Monarch Painting, queen, elegant" -o /tmp/queen.png queen

The vulkan device used was: Vulkan0: NVIDIA GeForce GTX 1050 Ti (NVIDIA) | uma: 0 | fp16: 0 | warp size: 32

The generation is at a decent speed: 3 iterations a second for 512x512 and 19 iterations a second for 1024x1024 (used PonyDiffusion for this with a type of q5_0).

Stable Diffusion v2 is very broken on my machine and created this mess using the following command: /gnu/store/nb7a2af09zayajxkj3mx2d4mpyp0bfhh-profile/bin/sd -m ./v2-1_768-ema-pruned.ckpt -p "Spanish Monarch Painting, queen, elegant" -o /tmp/queen.png queen

Furthermore, img2img is broken due to using way too much memory on my machine.

Still, this PR is looking promising for bringing speedy image generation to most GPUs.

sohzm commented 1 month ago

@leejet could you please review and merge this PR? If there's anything you'd like me to change or improve, please let me know. Thanks!

leejet / stable-diffusion.cpp

Add vulkan backend #291