Closed wailovet closed 6 months ago
make sure everything is the same, you can check the hashes of the model eg. Also, you can try different parameters, to help cage the bug.
I think it could be an issue related to the Tensor Cores.
Try uncommenting line 99 of ggml-cuda.cu
:
#define GGML_CUDA_MAX_NODES 8192
// define this if you want to always fallback to MMQ kernels and not use cuBLAS for matrix multiplication
// on modern hardware, using cuBLAS is recommended as it utilizes F16 tensor cores which are very performant
// for large computational tasks. the drawback is that this requires some extra amount of VRAM:
// - 7B quantum model: +100-200 MB
// - 13B quantum model: +200-400 MB
//
#define GGML_CUDA_FORCE_MMQ // decomment this line and try again
// TODO: improve this to be correct for more hardware
// for example, currently fails for GeForce GTX 1660 which is TURING arch (> VOLTA) but does not have tensor cores
// probably other such cases, and not sure what happens on AMD hardware
#if !defined(GGML_CUDA_FORCE_MMQ)
#define CUDA_USE_TENSOR_CORES
#endif
I think it could be an issue related to the Tensor Cores.
Try uncommenting line 99 of
ggml-cuda.cu
:#define GGML_CUDA_MAX_NODES 8192 // define this if you want to always fallback to MMQ kernels and not use cuBLAS for matrix multiplication // on modern hardware, using cuBLAS is recommended as it utilizes F16 tensor cores which are very performant // for large computational tasks. the drawback is that this requires some extra amount of VRAM: // - 7B quantum model: +100-200 MB // - 13B quantum model: +200-400 MB // #define GGML_CUDA_FORCE_MMQ // decomment this line and try again // TODO: improve this to be correct for more hardware // for example, currently fails for GeForce GTX 1660 which is TURING arch (> VOLTA) but does not have tensor cores // probably other such cases, and not sure what happens on AMD hardware #if !defined(GGML_CUDA_FORCE_MMQ) #define CUDA_USE_TENSOR_CORES #endif
The running speed is much faster, but the generated image is still pure green.
sd-cuda-mmq.exe -m meinamix_meinaV11-f16.gguf -p "1girl" -v
Option:
n_threads: 6
mode: txt2img
model_path: meinamix_meinaV11-f16.gguf
output_path: output.png
init_img:
prompt: 1girl
negative_prompt:
cfg_scale: 7.00
width: 512
height: 512
sample_method: euler_a
schedule: default
sample_steps: 20
strength: 0.75
rng: cuda
seed: 42
batch_count: 1
System Info:
BLAS = 1
SSE3 = 1
AVX = 1
AVX2 = 1
AVX512 = 0
AVX512_VBMI = 0
AVX512_VNNI = 0
FMA = 1
NEON = 0
ARM_FMA = 0
F16C = 1
FP16_VA = 0
WASM_SIMD = 0
VSX = 0
[DEBUG] stable-diffusion.cpp:3701 - Using CUDA backend
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1
[INFO] stable-diffusion.cpp:3715 - loading model from 'meinamix_meinaV11-f16.gguf'
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv 0: sd.model.name str
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv 1: sd.model.dtype i32
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv 2: sd.model.version i8
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv 3: sd.vocab.tokens arr
[INFO] stable-diffusion.cpp:3743 - Stable Diffusion 1.x | meinamix_meinaV11.safetensors
[INFO] stable-diffusion.cpp:3751 - model data type: f16
[DEBUG] stable-diffusion.cpp:3755 - loading vocab
[DEBUG] stable-diffusion.cpp:3771 - ggml tensor size = 416 bytes
[DEBUG] stable-diffusion.cpp:887 - clip params backend buffer size = 236.18 MB (449 tensors)
[DEBUG] stable-diffusion.cpp:2028 - unet params backend buffer size = 1641.16 MB (706 tensors)
[DEBUG] stable-diffusion.cpp:3118 - vae params backend buffer size = 95.47 MB (164 tensors)
[DEBUG] stable-diffusion.cpp:3780 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:3798 - loading weights
[DEBUG] stable-diffusion.cpp:3903 - model size = 1969.67MB
[INFO] stable-diffusion.cpp:3913 - total memory buffer size = 1972.80MB (clip 236.18MB, unet 1641.16MB, vae 95.47MB)
[INFO] stable-diffusion.cpp:3915 - loading model from 'meinamix_meinaV11-f16.gguf' completed, taking 7.03s
[INFO] stable-diffusion.cpp:3939 - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:3966 - finished loaded file
[DEBUG] stable-diffusion.cpp:4647 - prompt after extract and remove lora: "1girl"
[INFO] stable-diffusion.cpp:4652 - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:1118 - parse '1girl' to [['1girl', 1], ]
[DEBUG] stable-diffusion.cpp:521 - split prompt "1girl" to tokens ["1</w>", "girl</w>", ]
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 18 ms
[DEBUG] stable-diffusion.cpp:1118 - parse '' to [['', 1], ]
[DEBUG] stable-diffusion.cpp:521 - split prompt "" to tokens []
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 16 ms
[INFO] stable-diffusion.cpp:4681 - get_learned_condition completed, taking 40 ms
[INFO] stable-diffusion.cpp:4691 - sampling using Euler A method
[INFO] stable-diffusion.cpp:4694 - generating image: 1/1
[DEBUG] stable-diffusion.cpp:2384 - diffusion compute buffer size: 552.57 MB
|==================================================| 20/20 - 1.83s/it
[INFO] stable-diffusion.cpp:4706 - sampling completed, taking 37.21s
[INFO] stable-diffusion.cpp:4714 - generating 1 latent images completed, taking 37.23s
[INFO] stable-diffusion.cpp:4716 - decoding 1 latents
[DEBUG] stable-diffusion.cpp:3252 - vae compute buffer size: 1664.00 MB
[DEBUG] stable-diffusion.cpp:4605 - computing vae [mode: DECODE] graph completed, taking 6.58s
[INFO] stable-diffusion.cpp:4724 - latent 1 decoded, taking 6.58s
[INFO] stable-diffusion.cpp:4728 - decode_first_stage completed, taking 6.58s
[INFO] stable-diffusion.cpp:4735 - txt2img completed in 43.85s
save result image to 'output.png'
Try another model and a different prompt; attempt to generate with CPU if it provides a coherent image. I can't think of any ideas with the limited information I have.
make sure everything is the same, you can check the hashes of the model eg. Also, you can try different parameters, to help cage the bug.
The program and model are stored in my mobile hard disk. I tried f32, f16 and changed sample_method=LCM and got the same output image.
Try another model and a different prompt; attempt to generate with CPU if it provides a coherent image. I can't think of any ideas with the limited information I have.
Here are all the models I've tried, everything works fine with the cpu. Can you tell me where to add logs in the code to provide more information to locate this problem?
I try to hook here,replacing ggml_conv_2d with ggml_conv_2d_test Then I compared the first few values of the output results of the cpu version and the cuda version
Differences in "result"
CPU result
-0.304439 0.422194 0.0867985 -0.19692 0.223478 0.438775 -0.0987804 0.0194783 -0.650625 0.692133 -0.734613 -0.017556 1.1144 0.0192951 -0.619648 -0.0158069 -0.333611 0.840091 -1.09174 0.428399 0.341398 0.275071 -0.269062 -0.170968 -0.28541 -0.251124 0.208278 -0.29216 0.314511 -0.10386 0.0744066 0.141419
CUDA result
7.46726e-13 -2.17465e-09 8.88322e-10 -2.58623e-09 -4.62539e-10 4.44896e-09 8.31899e-10 -1.79477e-10 2.92877e-10 -1.62645e-09 1.65822e-09 -3.76766e-09 2.47073e-09 2.52588e-09 -2.88589e-10 2.86937e-10 4.25023e-11 2.44531e-10 3.76198e-10 -1.71427e-09 -1.19091e-09 2.9318e-09 -1.80849e-09 4.23559e-10 -1.52254e-09 3.87822e-09 3.07924e-10 -7.92377e-10 -1.29449e-09 -7.56863e-10 1.57558e-11 -1.30719e-09
Differences in "result"
CPU result -0.304439 0.422194 0.0867985 -0.19692 0.223478 0.438775 -0.0987804 0.0194783 -0.650625 0.692133 -0.734613 -0.017556 1.1144 0.0192951 -0.619648 -0.0158069 -0.333611 0.840091 -1.09174 0.428399 0.341398 0.275071 -0.269062 -0.170968 -0.28541 -0.251124 0.208278 -0.29216 0.314511 -0.10386 0.0744066 0.141419 CUDA result 7.46726e-13 -2.17465e-09 8.88322e-10 -2.58623e-09 -4.62539e-10 4.44896e-09 8.31899e-10 -1.79477e-10 2.92877e-10 -1.62645e-09 1.65822e-09 -3.76766e-09 2.47073e-09 2.52588e-09 -2.88589e-10 2.86937e-10 4.25023e-11 2.44531e-10 3.76198e-10 -1.71427e-09 -1.19091e-09 2.9318e-09 -1.80849e-09 4.23559e-10 -1.52254e-09 3.87822e-09 3.07924e-10 -7.92377e-10 -1.29449e-09 -7.56863e-10 1.57558e-11 -1.30719e-09
Could you compare the clip outputs (hidden state) of get_learned_condition() to confirm that only the im2col kernel could be causing issues?
You could also try my pull request #88; I optimized the im2col kernel to make more efficient use of GPU resources.
Differences in "result"
CPU result -0.304439 0.422194 0.0867985 -0.19692 0.223478 0.438775 -0.0987804 0.0194783 -0.650625 0.692133 -0.734613 -0.017556 1.1144 0.0192951 -0.619648 -0.0158069 -0.333611 0.840091 -1.09174 0.428399 0.341398 0.275071 -0.269062 -0.170968 -0.28541 -0.251124 0.208278 -0.29216 0.314511 -0.10386 0.0744066 0.141419 CUDA result 7.46726e-13 -2.17465e-09 8.88322e-10 -2.58623e-09 -4.62539e-10 4.44896e-09 8.31899e-10 -1.79477e-10 2.92877e-10 -1.62645e-09 1.65822e-09 -3.76766e-09 2.47073e-09 2.52588e-09 -2.88589e-10 2.86937e-10 4.25023e-11 2.44531e-10 3.76198e-10 -1.71427e-09 -1.19091e-09 2.9318e-09 -1.80849e-09 4.23559e-10 -1.52254e-09 3.87822e-09 3.07924e-10 -7.92377e-10 -1.29449e-09 -7.56863e-10 1.57558e-11 -1.30719e-09
Could you compare the clip outputs (hidden state) of get_learned_condition() to confirm that only the im2col kernel could be causing issues?
ggml_tensor* postive = sd->get_learned_condition(work_ctx, prompt);
CPU output: -0.387249 0.0171568 -0.054192 -0.183599 -0.0261911 -0.338466 -0.0235674 -0.187387 0.186605 -0.0903851
CUDA output: -0.387245 0.0171541 -0.0541848 -0.18359 -0.026197 -0.338474 -0.0235705 -0.187385 0.186602 -0.0903773
Maybe there are some subtle differences, but I think the impact should be minimal
Here is my check of the output
postive:✔️ negative:✔️
struct ggml_tensor im2col = ggml_im2col(ctx, a, b, s0, s1, p0, p1, d0, d1, true); im2col :✔️ struct ggml_tensor mma = ggml_reshape_2d(ctx, im2col, im2col->ne[0], im2col->ne[3] im2col->ne[2] im2col->ne[1]); mma :✔️ struct ggml_tensor mmb = ggml_reshape_2d(ctx, a, (a->ne[0] a->ne[1] a->ne[2]), a->ne[3]); mmb :✔️ struct ggml_tensor result = ggml_mul_mat(ctx, mma, mmb); result:❌
You could also try my pull request #88; I optimized the im2col kernel to make more efficient use of GPU resources.
I tried enabling taesd,got the result
Running it on another laptop of mine can generate normal images and the efficiency is significantly improved.
That seems quite challenging to debug as it is the matrix multiplication kernel, and I can't think of a solution since I wasn't the one who created it.
This is usually caused by insufficient GPU memory.
This is usually caused by insufficient GPU memory.
The user has a GTX 1070, it has 8GB VRAM, and I can run w/o issues with a RTX 3050 laptop 4GB VRAM
This is usually caused by insufficient GPU memory.
The user has a GTX 1070, it has 8GB VRAM, and I can run w/o issues with a RTX 3050 laptop 4GB VRAM
Can we get cuda version?
This is usually caused by insufficient GPU memory.
The user has a GTX 1070, it has 8GB VRAM, and I can run w/o issues with a RTX 3050 laptop 4GB VRAM
Can we get cuda version?
v11.8
It doesn't look wrong. How could this happen?
It doesn't look wrong. How could this happen?
I'm confused too, I tried using llama.cpp and it worked fine too. Maybe I should buy a new GPU
I had same issue has @wailovet. I ran the inferencing on MX150 Nvidia GPU, cuda v11.7
Could there be so compatibility issue with pascal GPU?
I'm not very experienced in CUDA; in fact, I'm struggling to add some features that could significantly accelerate image generation speed in CUDA. However, I'm facing many issues due to my lack of understanding in GPU engineering, so I can't shed light on the matter. I'm sorry that it's not working for some people. If I had equivalent hardware for testing, perhaps I could be of assistance.
Just to provide another data point and a potential fix.
I have a GTX 1070 and also got images with all green pixels. The CUDA version is 12.1
As @wailovet showed above, the problem seems coming from cuda version of mul_mat
. One observation is that if you run ggml
's test-conv2d
case, most likely it will fail if your GPU has computation capability <= 7.5.
I suspect the culprit is in
https://github.com/FSSRepo/ggml/blob/70474c6890c015b53dc10a2300ae35246cc73589/src/ggml-cuda.cu#L6953-L6979
Here src0
is converted to FP32 if it is not, but src1
is not checked and converted. If you add a similar section of code to convert src1
to FP32, test-conv2d
will pass. Unfortunately my fix crashed sd
although it made test-conv2d
pass. I lack the skill to make a bullet-proof fix and leave that to ones who can robustly do.
I have got a fix that works. Here is the patch.
diff --git a/src/ggml-cuda.cu b/src/ggml-cuda.cu
index 0d8b8d1..13e443f 100644
--- a/src/ggml-cuda.cu
+++ b/src/ggml-cuda.cu
@@ -6952,7 +6952,9 @@ inline void ggml_cuda_op_mul_mat_cublas(
}
else {
float * src0_ddq_as_f32 = nullptr;
+ float * src1_ddq_as_f32 = nullptr;
size_t src0_as = 0;
+ size_t src1_as = 0;
if (src0->type != GGML_TYPE_F32) {
const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(src0->type);
@@ -6960,7 +6962,15 @@ inline void ggml_cuda_op_mul_mat_cublas(
src0_ddq_as_f32 = (float *) ggml_cuda_pool_malloc(row_diff*ne00 * sizeof(float), &src0_as); // NOLINT
to_fp32_cuda(src0_dd_i, src0_ddq_as_f32, row_diff*ne00, stream);
}
+ if (src1->type != GGML_TYPE_F32) {
+ // printf(" src1 is not FP32 \n");
+ const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(src1->type);
+ GGML_ASSERT(to_fp32_cuda != nullptr);
+ src1_ddq_as_f32 = (float *) ggml_cuda_pool_malloc(src1_ncols*ne10 * sizeof(float), &src1_as); // NOLINT
+ to_fp32_cuda(src1_ddf_i, src1_ddq_as_f32, src1_ncols*ne10, stream);
+ }
const float * src0_ddf_i = src0->type == GGML_TYPE_F32 ? (const float *) src0_dd_i : src0_ddq_as_f32;
+ const float * src1_ddf1_i = src1->type == GGML_TYPE_F32 ? (const float *) src1_ddf_i : src1_ddq_as_f32;
const float alpha = 1.0f;
const float beta = 0.0f;
@@ -6970,12 +6980,15 @@ inline void ggml_cuda_op_mul_mat_cublas(
cublasSgemm(g_cublas_handles[id], CUBLAS_OP_T, CUBLAS_OP_N,
row_diff, src1_ncols, ne10,
&alpha, src0_ddf_i, ne00,
- src1_ddf_i, ne10,
+ src1_ddf1_i, ne10,
&beta, dst_dd_i, ldc));
if (src0_as != 0) {
ggml_cuda_pool_free(src0_ddq_as_f32, src0_as);
}
+ if (src1_as != 0) {
+ ggml_cuda_pool_free(src1_ddq_as_f32, src1_as);
+ }
}
Anyone with old NVIDIA GPUs can give a try. It also fixes two test cases: test-conv1d
and test-conv2d
Once the upstream ggml merges your PR, I'll update ggml to the corresponding commit to fix this issue.
Once the upstream ggml merges your PR, I'll update ggml to the corresponding commit to fix this issue.
That'll be great! Glad I can finally try SD with the generation old 1070. Still, it is much faster than CPU 😄
Once the upstream ggml merges your PR, I'll update ggml to the corresponding commit to fix this issue.
It has been merged now and fixed whisper.cpp for older GPUs. Time to fix this issue too? Thanks!
I've attempted to update this branch https://github.com/leejet/stable-diffusion.cpp/pull/134 to the latest ggml, but encountered some issues when generating images larger than 512x512. I haven't had time to pinpoint the exact cause yet.
@wailovet @bssrdf @SmallAndSoft I've updated ggml to the latest code. You can try using the latest master branch to see if the issue still persists.
@leejet That fixed the issue for my GTX 1060. Thank you very much!
Thank you, @leejet, for bringing in this update. For some reason, SD runs much faster on cuda backend with this update, especially the decoding latent step.
@wailovet @bssrdf @SmallAndSoft I've updated ggml to the latest code. You can try using the latest master branch to see if the issue still persists.
I tried the execution result of cuda and everything is fine Thank you very much!
I encountered a strange problem. After using CUDA, I got a pure green picture when running.But it works fine on another computer.