leejet / stable-diffusion.cpp

Stable Diffusion in pure C/C++
MIT License
2.91k stars 233 forks source link

CUDA cannot generate images #95

Closed wailovet closed 6 months ago

wailovet commented 7 months ago

I encountered a strange problem. After using CUDA, I got a pure green picture when running.But it works fine on another computer.

sd_cuda.exe  -m meinamix_meinaV11-f16.gguf -p "1girl" -v
Option:
    n_threads:       6
    mode:            txt2img
    model_path:      meinamix_meinaV11-f16.gguf
    output_path:     output.png
    init_img:
    prompt:          1girl
    negative_prompt:
    cfg_scale:       7.00
    width:           512
    height:          512
    sample_method:   euler_a
    schedule:        default
    sample_steps:    20
    strength:        0.75
    rng:             cuda
    seed:            42
    batch_count:     1
System Info:
    BLAS = 1
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:3701 - Using CUDA backend
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1
[INFO]  stable-diffusion.cpp:3715 - loading model from 'meinamix_meinaV11-f16.gguf'
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   0:                              sd.model.name str
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   1:                             sd.model.dtype i32
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   2:                           sd.model.version i8
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   3:                            sd.vocab.tokens arr
[INFO]  stable-diffusion.cpp:3743 - Stable Diffusion 1.x | meinamix_meinaV11.safetensors
[INFO]  stable-diffusion.cpp:3751 - model data type: f16
[DEBUG] stable-diffusion.cpp:3755 - loading vocab
[DEBUG] stable-diffusion.cpp:3771 - ggml tensor size = 416 bytes
[DEBUG] stable-diffusion.cpp:887  - clip params backend buffer size =  236.18 MB (449 tensors)
[DEBUG] stable-diffusion.cpp:2028 - unet params backend buffer size =  1641.16 MB (706 tensors)
[DEBUG] stable-diffusion.cpp:3118 - vae params backend buffer size =  95.47 MB (164 tensors)
[DEBUG] stable-diffusion.cpp:3780 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:3798 - loading weights
[DEBUG] stable-diffusion.cpp:3903 - model size = 1969.67MB
[INFO]  stable-diffusion.cpp:3913 - total memory buffer size = 1972.80MB (clip 236.18MB, unet 1641.16MB, vae 95.47MB)
[INFO]  stable-diffusion.cpp:3915 - loading model from 'meinamix_meinaV11-f16.gguf' completed, taking 0.92s
[INFO]  stable-diffusion.cpp:3939 - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:3966 - finished loaded file
[DEBUG] stable-diffusion.cpp:4647 - prompt after extract and remove lora: "1girl"
[INFO]  stable-diffusion.cpp:4652 - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:1118 - parse '1girl' to [['1girl', 1], ]
[DEBUG] stable-diffusion.cpp:521  - split prompt "1girl" to tokens ["1</w>", "girl</w>", ]
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 455 ms
[DEBUG] stable-diffusion.cpp:1118 - parse '' to [['', 1], ]
[DEBUG] stable-diffusion.cpp:521  - split prompt "" to tokens []
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 415 ms
[INFO]  stable-diffusion.cpp:4681 - get_learned_condition completed, taking 876 ms
[INFO]  stable-diffusion.cpp:4691 - sampling using Euler A method
[INFO]  stable-diffusion.cpp:4694 - generating image: 1/1
[DEBUG] stable-diffusion.cpp:2384 - diffusion compute buffer size: 552.57 MB
  |==================================================| 20/20 - 7.42s/it
[INFO]  stable-diffusion.cpp:4706 - sampling completed, taking 157.10s
[INFO]  stable-diffusion.cpp:4714 - generating 1 latent images completed, taking 157.12s
[INFO]  stable-diffusion.cpp:4716 - decoding 1 latents
[DEBUG] stable-diffusion.cpp:3252 - vae compute buffer size: 1664.00 MB
[DEBUG] stable-diffusion.cpp:4605 - computing vae [mode: DECODE] graph completed, taking 6.65s
[INFO]  stable-diffusion.cpp:4724 - latent 1 decoded, taking 6.66s
[INFO]  stable-diffusion.cpp:4728 - decode_first_stage completed, taking 6.66s
[INFO]  stable-diffusion.cpp:4735 - txt2img completed in 164.66s
save result image to 'output.png'

output

Green-Sky commented 7 months ago

make sure everything is the same, you can check the hashes of the model eg. Also, you can try different parameters, to help cage the bug.

FSSRepo commented 7 months ago

I think it could be an issue related to the Tensor Cores.

Try uncommenting line 99 of ggml-cuda.cu:


#define GGML_CUDA_MAX_NODES 8192

// define this if you want to always fallback to MMQ kernels and not use cuBLAS for matrix multiplication
// on modern hardware, using cuBLAS is recommended as it utilizes F16 tensor cores which are very performant
// for large computational tasks. the drawback is that this requires some extra amount of VRAM:
// -  7B quantum model: +100-200 MB
// - 13B quantum model: +200-400 MB
//

#define GGML_CUDA_FORCE_MMQ  // decomment this line and try again

// TODO: improve this to be correct for more hardware
//       for example, currently fails for GeForce GTX 1660 which is TURING arch (> VOLTA) but does not have tensor cores
//       probably other such cases, and not sure what happens on AMD hardware
#if !defined(GGML_CUDA_FORCE_MMQ)
#define CUDA_USE_TENSOR_CORES
#endif
wailovet commented 7 months ago

I think it could be an issue related to the Tensor Cores.

Try uncommenting line 99 of ggml-cuda.cu:

#define GGML_CUDA_MAX_NODES 8192

// define this if you want to always fallback to MMQ kernels and not use cuBLAS for matrix multiplication
// on modern hardware, using cuBLAS is recommended as it utilizes F16 tensor cores which are very performant
// for large computational tasks. the drawback is that this requires some extra amount of VRAM:
// -  7B quantum model: +100-200 MB
// - 13B quantum model: +200-400 MB
//

#define GGML_CUDA_FORCE_MMQ  // decomment this line and try again

// TODO: improve this to be correct for more hardware
//       for example, currently fails for GeForce GTX 1660 which is TURING arch (> VOLTA) but does not have tensor cores
//       probably other such cases, and not sure what happens on AMD hardware
#if !defined(GGML_CUDA_FORCE_MMQ)
#define CUDA_USE_TENSOR_CORES
#endif

The running speed is much faster, but the generated image is still pure green.

sd-cuda-mmq.exe  -m meinamix_meinaV11-f16.gguf -p "1girl" -v
Option:
    n_threads:       6
    mode:            txt2img
    model_path:      meinamix_meinaV11-f16.gguf
    output_path:     output.png
    init_img:
    prompt:          1girl
    negative_prompt:
    cfg_scale:       7.00
    width:           512
    height:          512
    sample_method:   euler_a
    schedule:        default
    sample_steps:    20
    strength:        0.75
    rng:             cuda
    seed:            42
    batch_count:     1
System Info:
    BLAS = 1
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:3701 - Using CUDA backend
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1
[INFO]  stable-diffusion.cpp:3715 - loading model from 'meinamix_meinaV11-f16.gguf'
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   0:                              sd.model.name str
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   1:                             sd.model.dtype i32
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   2:                           sd.model.version i8
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   3:                            sd.vocab.tokens arr
[INFO]  stable-diffusion.cpp:3743 - Stable Diffusion 1.x | meinamix_meinaV11.safetensors
[INFO]  stable-diffusion.cpp:3751 - model data type: f16
[DEBUG] stable-diffusion.cpp:3755 - loading vocab
[DEBUG] stable-diffusion.cpp:3771 - ggml tensor size = 416 bytes
[DEBUG] stable-diffusion.cpp:887  - clip params backend buffer size =  236.18 MB (449 tensors)
[DEBUG] stable-diffusion.cpp:2028 - unet params backend buffer size =  1641.16 MB (706 tensors)
[DEBUG] stable-diffusion.cpp:3118 - vae params backend buffer size =  95.47 MB (164 tensors)
[DEBUG] stable-diffusion.cpp:3780 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:3798 - loading weights
[DEBUG] stable-diffusion.cpp:3903 - model size = 1969.67MB
[INFO]  stable-diffusion.cpp:3913 - total memory buffer size = 1972.80MB (clip 236.18MB, unet 1641.16MB, vae 95.47MB)
[INFO]  stable-diffusion.cpp:3915 - loading model from 'meinamix_meinaV11-f16.gguf' completed, taking 7.03s
[INFO]  stable-diffusion.cpp:3939 - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:3966 - finished loaded file
[DEBUG] stable-diffusion.cpp:4647 - prompt after extract and remove lora: "1girl"
[INFO]  stable-diffusion.cpp:4652 - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:1118 - parse '1girl' to [['1girl', 1], ]
[DEBUG] stable-diffusion.cpp:521  - split prompt "1girl" to tokens ["1</w>", "girl</w>", ]
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 18 ms
[DEBUG] stable-diffusion.cpp:1118 - parse '' to [['', 1], ]
[DEBUG] stable-diffusion.cpp:521  - split prompt "" to tokens []
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 16 ms
[INFO]  stable-diffusion.cpp:4681 - get_learned_condition completed, taking 40 ms
[INFO]  stable-diffusion.cpp:4691 - sampling using Euler A method
[INFO]  stable-diffusion.cpp:4694 - generating image: 1/1
[DEBUG] stable-diffusion.cpp:2384 - diffusion compute buffer size: 552.57 MB
  |==================================================| 20/20 - 1.83s/it
[INFO]  stable-diffusion.cpp:4706 - sampling completed, taking 37.21s
[INFO]  stable-diffusion.cpp:4714 - generating 1 latent images completed, taking 37.23s
[INFO]  stable-diffusion.cpp:4716 - decoding 1 latents
[DEBUG] stable-diffusion.cpp:3252 - vae compute buffer size: 1664.00 MB
[DEBUG] stable-diffusion.cpp:4605 - computing vae [mode: DECODE] graph completed, taking 6.58s
[INFO]  stable-diffusion.cpp:4724 - latent 1 decoded, taking 6.58s
[INFO]  stable-diffusion.cpp:4728 - decode_first_stage completed, taking 6.58s
[INFO]  stable-diffusion.cpp:4735 - txt2img completed in 43.85s
save result image to 'output.png'
FSSRepo commented 7 months ago

Try another model and a different prompt; attempt to generate with CPU if it provides a coherent image. I can't think of any ideas with the limited information I have.

wailovet commented 7 months ago

make sure everything is the same, you can check the hashes of the model eg. Also, you can try different parameters, to help cage the bug.

The program and model are stored in my mobile hard disk. I tried f32, f16 and changed sample_method=LCM and got the same output image.

wailovet commented 7 months ago

Try another model and a different prompt; attempt to generate with CPU if it provides a coherent image. I can't think of any ideas with the limited information I have.

image Here are all the models I've tried, everything works fine with the cpu. Can you tell me where to add logs in the code to provide more information to locate this problem?

wailovet commented 7 months ago

image I try to hook here,replacing ggml_conv_2d with ggml_conv_2d_test image Then I compared the first few values ​​of the output results of the cpu version and the cuda version

Differences in "result"

CPU result
-0.304439 0.422194 0.0867985 -0.19692 0.223478 0.438775 -0.0987804 0.0194783 -0.650625 0.692133 -0.734613 -0.017556 1.1144 0.0192951 -0.619648 -0.0158069 -0.333611 0.840091 -1.09174 0.428399 0.341398 0.275071 -0.269062 -0.170968 -0.28541 -0.251124 0.208278 -0.29216 0.314511 -0.10386 0.0744066 0.141419

CUDA result
7.46726e-13 -2.17465e-09 8.88322e-10 -2.58623e-09 -4.62539e-10 4.44896e-09 8.31899e-10 -1.79477e-10 2.92877e-10 -1.62645e-09 1.65822e-09 -3.76766e-09 2.47073e-09 2.52588e-09 -2.88589e-10 2.86937e-10 4.25023e-11 2.44531e-10 3.76198e-10 -1.71427e-09 -1.19091e-09 2.9318e-09 -1.80849e-09 4.23559e-10 -1.52254e-09 3.87822e-09 3.07924e-10 -7.92377e-10 -1.29449e-09 -7.56863e-10 1.57558e-11 -1.30719e-09 
FSSRepo commented 7 months ago

Differences in "result"

CPU result
-0.304439 0.422194 0.0867985 -0.19692 0.223478 0.438775 -0.0987804 0.0194783 -0.650625 0.692133 -0.734613 -0.017556 1.1144 0.0192951 -0.619648 -0.0158069 -0.333611 0.840091 -1.09174 0.428399 0.341398 0.275071 -0.269062 -0.170968 -0.28541 -0.251124 0.208278 -0.29216 0.314511 -0.10386 0.0744066 0.141419

CUDA result
7.46726e-13 -2.17465e-09 8.88322e-10 -2.58623e-09 -4.62539e-10 4.44896e-09 8.31899e-10 -1.79477e-10 2.92877e-10 -1.62645e-09 1.65822e-09 -3.76766e-09 2.47073e-09 2.52588e-09 -2.88589e-10 2.86937e-10 4.25023e-11 2.44531e-10 3.76198e-10 -1.71427e-09 -1.19091e-09 2.9318e-09 -1.80849e-09 4.23559e-10 -1.52254e-09 3.87822e-09 3.07924e-10 -7.92377e-10 -1.29449e-09 -7.56863e-10 1.57558e-11 -1.30719e-09 

Could you compare the clip outputs (hidden state) of get_learned_condition() to confirm that only the im2col kernel could be causing issues?

FSSRepo commented 7 months ago

You could also try my pull request #88; I optimized the im2col kernel to make more efficient use of GPU resources.

wailovet commented 7 months ago

Differences in "result"

CPU result
-0.304439 0.422194 0.0867985 -0.19692 0.223478 0.438775 -0.0987804 0.0194783 -0.650625 0.692133 -0.734613 -0.017556 1.1144 0.0192951 -0.619648 -0.0158069 -0.333611 0.840091 -1.09174 0.428399 0.341398 0.275071 -0.269062 -0.170968 -0.28541 -0.251124 0.208278 -0.29216 0.314511 -0.10386 0.0744066 0.141419

CUDA result
7.46726e-13 -2.17465e-09 8.88322e-10 -2.58623e-09 -4.62539e-10 4.44896e-09 8.31899e-10 -1.79477e-10 2.92877e-10 -1.62645e-09 1.65822e-09 -3.76766e-09 2.47073e-09 2.52588e-09 -2.88589e-10 2.86937e-10 4.25023e-11 2.44531e-10 3.76198e-10 -1.71427e-09 -1.19091e-09 2.9318e-09 -1.80849e-09 4.23559e-10 -1.52254e-09 3.87822e-09 3.07924e-10 -7.92377e-10 -1.29449e-09 -7.56863e-10 1.57558e-11 -1.30719e-09 

Could you compare the clip outputs (hidden state) of get_learned_condition() to confirm that only the im2col kernel could be causing issues?

ggml_tensor* postive = sd->get_learned_condition(work_ctx, prompt);
CPU output: -0.387249 0.0171568 -0.054192 -0.183599 -0.0261911 -0.338466 -0.0235674 -0.187387 0.186605 -0.0903851 CUDA output: -0.387245 0.0171541 -0.0541848 -0.18359 -0.026197 -0.338474 -0.0235705 -0.187385 0.186602 -0.0903773 Maybe there are some subtle differences, but I think the impact should be minimal

Here is my check of the output

image postive:✔️ image negative:✔️

struct ggml_tensor im2col = ggml_im2col(ctx, a, b, s0, s1, p0, p1, d0, d1, true); im2col :✔️ struct ggml_tensor mma = ggml_reshape_2d(ctx, im2col, im2col->ne[0], im2col->ne[3] im2col->ne[2] im2col->ne[1]); mma :✔️ struct ggml_tensor mmb = ggml_reshape_2d(ctx, a, (a->ne[0] a->ne[1] a->ne[2]), a->ne[3]); mmb :✔️ struct ggml_tensor result = ggml_mul_mat(ctx, mma, mmb); result:❌

wailovet commented 7 months ago

You could also try my pull request #88; I optimized the im2col kernel to make more efficient use of GPU resources.

I tried enabling taesd,got the result output

Running it on another laptop of mine can generate normal images and the efficiency is significantly improved.

FSSRepo commented 7 months ago

That seems quite challenging to debug as it is the matrix multiplication kernel, and I can't think of a solution since I wasn't the one who created it.

Cyberhan123 commented 7 months ago

This is usually caused by insufficient GPU memory.

FSSRepo commented 7 months ago

This is usually caused by insufficient GPU memory.

The user has a GTX 1070, it has 8GB VRAM, and I can run w/o issues with a RTX 3050 laptop 4GB VRAM

Cyberhan123 commented 7 months ago

This is usually caused by insufficient GPU memory.

The user has a GTX 1070, it has 8GB VRAM, and I can run w/o issues with a RTX 3050 laptop 4GB VRAM

Can we get cuda version?

wailovet commented 7 months ago

This is usually caused by insufficient GPU memory.

The user has a GTX 1070, it has 8GB VRAM, and I can run w/o issues with a RTX 3050 laptop 4GB VRAM

Can we get cuda version?

v11.8

Cyberhan123 commented 7 months ago

It doesn't look wrong. How could this happen?

wailovet commented 7 months ago

It doesn't look wrong. How could this happen?

I'm confused too, I tried using llama.cpp and it worked fine too. Maybe I should buy a new GPU

olumideolufy29 commented 7 months ago

I had same issue has @wailovet. I ran the inferencing on MX150 Nvidia GPU, cuda v11.7

Could there be so compatibility issue with pascal GPU?

FSSRepo commented 7 months ago

I'm not very experienced in CUDA; in fact, I'm struggling to add some features that could significantly accelerate image generation speed in CUDA. However, I'm facing many issues due to my lack of understanding in GPU engineering, so I can't shed light on the matter. I'm sorry that it's not working for some people. If I had equivalent hardware for testing, perhaps I could be of assistance.

bssrdf commented 6 months ago

Just to provide another data point and a potential fix.

I have a GTX 1070 and also got images with all green pixels. The CUDA version is 12.1

As @wailovet showed above, the problem seems coming from cuda version of mul_mat. One observation is that if you run ggml's test-conv2d case, most likely it will fail if your GPU has computation capability <= 7.5.

I suspect the culprit is in https://github.com/FSSRepo/ggml/blob/70474c6890c015b53dc10a2300ae35246cc73589/src/ggml-cuda.cu#L6953-L6979 Here src0 is converted to FP32 if it is not, but src1 is not checked and converted. If you add a similar section of code to convert src1 to FP32, test-conv2d will pass. Unfortunately my fix crashed sd although it made test-conv2d pass. I lack the skill to make a bullet-proof fix and leave that to ones who can robustly do.

I have got a fix that works. Here is the patch.

diff --git a/src/ggml-cuda.cu b/src/ggml-cuda.cu
index 0d8b8d1..13e443f 100644
--- a/src/ggml-cuda.cu
+++ b/src/ggml-cuda.cu
@@ -6952,7 +6952,9 @@ inline void ggml_cuda_op_mul_mat_cublas(
     }
     else {
         float * src0_ddq_as_f32 = nullptr;
+        float * src1_ddq_as_f32 = nullptr;
         size_t src0_as = 0;
+        size_t src1_as = 0;

         if (src0->type != GGML_TYPE_F32) {
             const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(src0->type);
@@ -6960,7 +6962,15 @@ inline void ggml_cuda_op_mul_mat_cublas(
             src0_ddq_as_f32 = (float *) ggml_cuda_pool_malloc(row_diff*ne00 * sizeof(float), &src0_as); // NOLINT
             to_fp32_cuda(src0_dd_i, src0_ddq_as_f32, row_diff*ne00, stream);
         }
+        if (src1->type != GGML_TYPE_F32) {
+            // printf(" src1 is not FP32 \n");
+            const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(src1->type);
+            GGML_ASSERT(to_fp32_cuda != nullptr);
+            src1_ddq_as_f32 = (float *) ggml_cuda_pool_malloc(src1_ncols*ne10 * sizeof(float), &src1_as); // NOLINT
+            to_fp32_cuda(src1_ddf_i, src1_ddq_as_f32, src1_ncols*ne10, stream);
+        }
         const float * src0_ddf_i = src0->type == GGML_TYPE_F32 ? (const float *) src0_dd_i : src0_ddq_as_f32;
+        const float * src1_ddf1_i = src1->type == GGML_TYPE_F32 ? (const float *) src1_ddf_i : src1_ddq_as_f32;

         const float alpha = 1.0f;
         const float beta = 0.0f;
@@ -6970,12 +6980,15 @@ inline void ggml_cuda_op_mul_mat_cublas(
             cublasSgemm(g_cublas_handles[id], CUBLAS_OP_T, CUBLAS_OP_N,
                     row_diff, src1_ncols, ne10,
                     &alpha, src0_ddf_i, ne00,
-                            src1_ddf_i, ne10,
+                            src1_ddf1_i, ne10,
                     &beta,  dst_dd_i,   ldc));

         if (src0_as != 0) {
             ggml_cuda_pool_free(src0_ddq_as_f32, src0_as);
         }
+        if (src1_as != 0) {
+            ggml_cuda_pool_free(src1_ddq_as_f32, src1_as);
+        }
     }

Anyone with old NVIDIA GPUs can give a try. It also fixes two test cases: test-conv1d and test-conv2d

leejet commented 6 months ago

Once the upstream ggml merges your PR, I'll update ggml to the corresponding commit to fix this issue.

bssrdf commented 6 months ago

Once the upstream ggml merges your PR, I'll update ggml to the corresponding commit to fix this issue.

That'll be great! Glad I can finally try SD with the generation old 1070. Still, it is much faster than CPU 😄

SmallAndSoft commented 6 months ago

Once the upstream ggml merges your PR, I'll update ggml to the corresponding commit to fix this issue.

It has been merged now and fixed whisper.cpp for older GPUs. Time to fix this issue too? Thanks!

leejet commented 6 months ago

I've attempted to update this branch https://github.com/leejet/stable-diffusion.cpp/pull/134 to the latest ggml, but encountered some issues when generating images larger than 512x512. I haven't had time to pinpoint the exact cause yet.

leejet commented 6 months ago

@wailovet @bssrdf @SmallAndSoft I've updated ggml to the latest code. You can try using the latest master branch to see if the issue still persists.

SmallAndSoft commented 6 months ago

@leejet That fixed the issue for my GTX 1060. Thank you very much!

bssrdf commented 6 months ago

Thank you, @leejet, for bringing in this update. For some reason, SD runs much faster on cuda backend with this update, especially the decoding latent step.

wailovet commented 6 months ago

@wailovet @bssrdf @SmallAndSoft I've updated ggml to the latest code. You can try using the latest master branch to see if the issue still persists.

I tried the execution result of cuda and everything is fine Thank you very much!