leejet / stable-diffusion.cpp

Stable Diffusion and Flux in pure C/C++
MIT License
3.29k stars 277 forks source link

"CUDA error" when set resolution higher than 1280 x 1280 #156

Open XienXX opened 8 months ago

XienXX commented 8 months ago

CUDA Version:12.3 GPU: RTX 4080 16G

Model works alright under the condition of 1024 x 1024. But if I set it to 1280x1280 or above, the launch will fails. Check below: 1280x1280 resolution, failed: PS D:\xien\stable-diffusion.cpp\build\bin\Release> .\sd.exe -m ../v2-1_768-nonema-pruned.safetensors --type f16 -p "a lovely cat" -H 1280 -W 1280 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes [INFO ] stable-diffusion.cpp:137 - loading model from '../v2-1_768-nonema-pruned.safetensors' [INFO ] model.cpp:641 - load ../v2-1_768-nonema-pruned.safetensors using safetensors format [INFO ] stable-diffusion.cpp:163 - Stable Diffusion 2.x [INFO ] stable-diffusion.cpp:169 - Stable Diffusion weight type: f16 [INFO ] stable-diffusion.cpp:268 - total memory buffer size = 2450.99MB (clip 684.18MB, unet 1662.34MB, vae 104.47MB) [INFO ] stable-diffusion.cpp:270 - loading model from '../v2-1_768-nonema-pruned.safetensors' completed, taking 2.67s [INFO ] stable-diffusion.cpp:282 - running in v-prediction mode [INFO ] stable-diffusion.cpp:1182 - apply_loras completed, taking 0.00s [INFO ] stable-diffusion.cpp:1221 - get_learned_condition completed, taking 28 ms [INFO ] stable-diffusion.cpp:1231 - sampling using Euler A method [INFO ] stable-diffusion.cpp:1235 - generating image: 1/1 - seed 42 |> | 0/20 - 0.00it/sCUDA error: the function failed to launch on the GPU current device: 0, in function ggml_cuda_op_mul_mat_cublas at D:\xien\stable-diffusion.cpp\ggml\src\ggml-cuda.cu:7650 cublasSgemm_v2(g_cublas_handles[id], CUBLAS_OP_T, CUBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha, src0_ddf_i, ne00, src1_ddf1_i, ne10, &beta, dst_dd_i, ldc) GGML_ASSERT: D:\xien\stable-diffusion.cpp\ggml\src\ggml-cuda.cu:226: !"CUDA error"

1280x1024 resolution, worked: PS D:\xien\stable-diffusion.cpp\build\bin\Release> .\sd.exe -m ../v2-1_768-nonema-pruned.safetensors --type f16 -p "a lovely cat" -H 1280 -W 1024 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes [INFO ] stable-diffusion.cpp:137 - loading model from '../v2-1_768-nonema-pruned.safetensors' [INFO ] model.cpp:641 - load ../v2-1_768-nonema-pruned.safetensors using safetensors format [INFO ] stable-diffusion.cpp:163 - Stable Diffusion 2.x [INFO ] stable-diffusion.cpp:169 - Stable Diffusion weight type: f16 [INFO ] stable-diffusion.cpp:268 - total memory buffer size = 2450.99MB (clip 684.18MB, unet 1662.34MB, vae 104.47MB) [INFO ] stable-diffusion.cpp:270 - loading model from '../v2-1_768-nonema-pruned.safetensors' completed, taking 2.69s [INFO ] stable-diffusion.cpp:282 - running in v-prediction mode [INFO ] stable-diffusion.cpp:1182 - apply_loras completed, taking 0.00s [INFO ] stable-diffusion.cpp:1221 - get_learned_condition completed, taking 30 ms [INFO ] stable-diffusion.cpp:1231 - sampling using Euler A method [INFO ] stable-diffusion.cpp:1235 - generating image: 1/1 - seed 42 |==================================================| 20/20 - 1.08it/s [INFO ] stable-diffusion.cpp:1247 - sampling completed, taking 19.60s [INFO ] stable-diffusion.cpp:1255 - generating 1 latent images completed, taking 19.61s [INFO ] stable-diffusion.cpp:1257 - decoding 1 latents [INFO ] stable-diffusion.cpp:1267 - latent 1 decoded, taking 1.45s [INFO ] stable-diffusion.cpp:1271 - decode_first_stage completed, taking 1.45s [INFO ] stable-diffusion.cpp:1290 - txt2img completed in 21.09s save result image to 'output.png'

屏幕截图 2024-01-23 102337 image

XienXX commented 8 months ago

I switch to RTX 5000 Ada(48G) and the model goes the same. please help!!

FSSRepo commented 8 months ago

It seems to be an error in the way matrix multiplications are performed in ggml. Does it work if you do it only with CPU?

XienXX commented 8 months ago

It seems to be an error in the way matrix multiplications are performed in ggml. Does it work if you do it only with CPU?

image

Seem not. Shall I re-cmake it again?

FSSRepo commented 8 months ago

@XienXX cmake .. - DSD_CUBLAS=OFF

XienXX commented 8 months ago

@XienXX cmake .. - DSD_CUBLAS=OFF

image Yep thanks, it could work, but the speed is way too slow XD. How could I run it with GPU?

errnoh commented 6 months ago

Can replicate this with HIPBLAS. 768x768 works, 768x1024 works, 1024x1024 fails, 1280x1280 fails. Interesting also how 1024x1024 and 1280x1280 fail in different ways.

EDIT: Actually that seems to be only happening with v1.5 model. SDXL works fine with 1280x1280.

[errnoh@desk:~/dev/AI/stable-diffusion.cpp]$ ./result/bin/sd -m /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors   -p "a lovely cat" -H 768 -W 768
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
[INFO ] stable-diffusion.cpp:171  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors'
[INFO ] model.cpp:726  - load /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:194  - Stable Diffusion 1.x 
[INFO ] stable-diffusion.cpp:200  - Stable Diffusion weight type: f32
[INFO ] stable-diffusion.cpp:421  - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:425  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors' completed, taking 1.09s
[INFO ] stable-diffusion.cpp:442  - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:553  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1608 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1718 - get_learned_condition completed, taking 24 ms
[INFO ] stable-diffusion.cpp:1734 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1738 - generating image: 1/1 - seed 42
  |==================================================| 20/20 - 1.96it/s
[INFO ] stable-diffusion.cpp:1775 - sampling completed, taking 10.33s
[INFO ] stable-diffusion.cpp:1783 - generating 1 latent images completed, taking 10.33s
[INFO ] stable-diffusion.cpp:1785 - decoding 1 latents
[INFO ] stable-diffusion.cpp:1795 - latent 1 decoded, taking 1.51s
[INFO ] stable-diffusion.cpp:1799 - decode_first_stage completed, taking 1.51s
[INFO ] stable-diffusion.cpp:1818 - txt2img completed in 11.86s
save result image to 'output.png'

[errnoh@desk:~/dev/AI/stable-diffusion.cpp]$ ./result/bin/sd -m /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors   -p "a lovely cat" -H 1024 -W 1024
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
[INFO ] stable-diffusion.cpp:171  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors'
[INFO ] model.cpp:726  - load /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:194  - Stable Diffusion 1.x 
[INFO ] stable-diffusion.cpp:200  - Stable Diffusion weight type: f32
[INFO ] stable-diffusion.cpp:421  - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:425  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors' completed, taking 1.11s
[INFO ] stable-diffusion.cpp:442  - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:553  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1608 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1718 - get_learned_condition completed, taking 24 ms
[INFO ] stable-diffusion.cpp:1734 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1738 - generating image: 1/1 - seed 42
CUDA error: invalid configuration argument
  current device: 0, in function ggml_cuda_op_scale at /build/stable-diffusion.cpp/ggml/src/ggml-cuda.cu:10030
  hipGetLastError()
GGML_ASSERT: /build/stable-diffusion.cpp/ggml/src/ggml-cuda.cu:255: !"CUDA error"
Aborted (core dumped)

[errnoh@desk:~/dev/AI/stable-diffusion.cpp]$ ./result/bin/sd -m /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors   -p "a lovely cat" -H 1280 -W 1280
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
[INFO ] stable-diffusion.cpp:171  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors'
[INFO ] model.cpp:726  - load /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:194  - Stable Diffusion 1.x 
[INFO ] stable-diffusion.cpp:200  - Stable Diffusion weight type: f32
[INFO ] stable-diffusion.cpp:421  - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:425  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors' completed, taking 1.09s
[INFO ] stable-diffusion.cpp:442  - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:553  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1608 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1718 - get_learned_condition completed, taking 25 ms
[INFO ] stable-diffusion.cpp:1734 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1738 - generating image: 1/1 - seed 42
Memory access fault by GPU node-1 (Agent handle: 0x557c4d24b7d0) on address 0x7fc5fdc8b000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)
DGdev91 commented 4 months ago

Same here, HIPblas, RX 7900XT the maximum i managed to make on SD 1.5 is 960x1024, while on SDXL i managed to make a 1920x1920 picture, before encountering the same issue.