"CUDA error" when set resolution higher than 1280 x 1280

XienXX commented 8 months ago

CUDA Version:12.3 GPU: RTX 4080 16G

Model works alright under the condition of 1024 x 1024. But if I set it to 1280x1280 or above, the launch will fails. Check below: 1280x1280 resolution, failed: PS D:\xien\stable-diffusion.cpp\build\bin\Release> .\sd.exe -m ../v2-1_768-nonema-pruned.safetensors --type f16 -p "a lovely cat" -H 1280 -W 1280 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes [INFO ] stable-diffusion.cpp:137 - loading model from '../v2-1_768-nonema-pruned.safetensors' [INFO ] model.cpp:641 - load ../v2-1_768-nonema-pruned.safetensors using safetensors format [INFO ] stable-diffusion.cpp:163 - Stable Diffusion 2.x [INFO ] stable-diffusion.cpp:169 - Stable Diffusion weight type: f16 [INFO ] stable-diffusion.cpp:268 - total memory buffer size = 2450.99MB (clip 684.18MB, unet 1662.34MB, vae 104.47MB) [INFO ] stable-diffusion.cpp:270 - loading model from '../v2-1_768-nonema-pruned.safetensors' completed, taking 2.67s [INFO ] stable-diffusion.cpp:282 - running in v-prediction mode [INFO ] stable-diffusion.cpp:1182 - apply_loras completed, taking 0.00s [INFO ] stable-diffusion.cpp:1221 - get_learned_condition completed, taking 28 ms [INFO ] stable-diffusion.cpp:1231 - sampling using Euler A method [INFO ] stable-diffusion.cpp:1235 - generating image: 1/1 - seed 42 |> | 0/20 - 0.00it/sCUDA error: the function failed to launch on the GPU current device: 0, in function ggml_cuda_op_mul_mat_cublas at D:\xien\stable-diffusion.cpp\ggml\src\ggml-cuda.cu:7650 cublasSgemm_v2(g_cublas_handles[id], CUBLAS_OP_T, CUBLAS_OP_N, row_diff, src1_ncols, ne10, &alpha, src0_ddf_i, ne00, src1_ddf1_i, ne10, &beta, dst_dd_i, ldc) GGML_ASSERT: D:\xien\stable-diffusion.cpp\ggml\src\ggml-cuda.cu:226: !"CUDA error"

1280x1024 resolution, worked: PS D:\xien\stable-diffusion.cpp\build\bin\Release> .\sd.exe -m ../v2-1_768-nonema-pruned.safetensors --type f16 -p "a lovely cat" -H 1280 -W 1024 ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 4080, compute capability 8.9, VMM: yes [INFO ] stable-diffusion.cpp:137 - loading model from '../v2-1_768-nonema-pruned.safetensors' [INFO ] model.cpp:641 - load ../v2-1_768-nonema-pruned.safetensors using safetensors format [INFO ] stable-diffusion.cpp:163 - Stable Diffusion 2.x [INFO ] stable-diffusion.cpp:169 - Stable Diffusion weight type: f16 [INFO ] stable-diffusion.cpp:268 - total memory buffer size = 2450.99MB (clip 684.18MB, unet 1662.34MB, vae 104.47MB) [INFO ] stable-diffusion.cpp:270 - loading model from '../v2-1_768-nonema-pruned.safetensors' completed, taking 2.69s [INFO ] stable-diffusion.cpp:282 - running in v-prediction mode [INFO ] stable-diffusion.cpp:1182 - apply_loras completed, taking 0.00s [INFO ] stable-diffusion.cpp:1221 - get_learned_condition completed, taking 30 ms [INFO ] stable-diffusion.cpp:1231 - sampling using Euler A method [INFO ] stable-diffusion.cpp:1235 - generating image: 1/1 - seed 42 |==================================================| 20/20 - 1.08it/s [INFO ] stable-diffusion.cpp:1247 - sampling completed, taking 19.60s [INFO ] stable-diffusion.cpp:1255 - generating 1 latent images completed, taking 19.61s [INFO ] stable-diffusion.cpp:1257 - decoding 1 latents [INFO ] stable-diffusion.cpp:1267 - latent 1 decoded, taking 1.45s [INFO ] stable-diffusion.cpp:1271 - decode_first_stage completed, taking 1.45s [INFO ] stable-diffusion.cpp:1290 - txt2img completed in 21.09s save result image to 'output.png'

屏幕截图 2024-01-23 102337

XienXX commented 8 months ago

I switch to RTX 5000 Ada(48G) and the model goes the same. please help!!

FSSRepo commented 8 months ago

It seems to be an error in the way matrix multiplications are performed in ggml. Does it work if you do it only with CPU?

XienXX commented 8 months ago

It seems to be an error in the way matrix multiplications are performed in ggml. Does it work if you do it only with CPU?

Seem not. Shall I re-cmake it again?

FSSRepo commented 8 months ago

@XienXX cmake .. - DSD_CUBLAS=OFF

XienXX commented 8 months ago

@XienXX cmake .. - DSD_CUBLAS=OFF

Yep thanks, it could work, but the speed is way too slow XD. How could I run it with GPU?

errnoh commented 6 months ago

Can replicate this with HIPBLAS. 768x768 works, 768x1024 works, 1024x1024 fails, 1280x1280 fails. Interesting also how 1024x1024 and 1280x1280 fail in different ways.

EDIT: Actually that seems to be only happening with v1.5 model. SDXL works fine with 1280x1280.

[errnoh@desk:~/dev/AI/stable-diffusion.cpp]$ ./result/bin/sd -m /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors   -p "a lovely cat" -H 768 -W 768
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
[INFO ] stable-diffusion.cpp:171  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors'
[INFO ] model.cpp:726  - load /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:194  - Stable Diffusion 1.x 
[INFO ] stable-diffusion.cpp:200  - Stable Diffusion weight type: f32
[INFO ] stable-diffusion.cpp:421  - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:425  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors' completed, taking 1.09s
[INFO ] stable-diffusion.cpp:442  - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:553  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1608 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1718 - get_learned_condition completed, taking 24 ms
[INFO ] stable-diffusion.cpp:1734 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1738 - generating image: 1/1 - seed 42
  |==================================================| 20/20 - 1.96it/s
[INFO ] stable-diffusion.cpp:1775 - sampling completed, taking 10.33s
[INFO ] stable-diffusion.cpp:1783 - generating 1 latent images completed, taking 10.33s
[INFO ] stable-diffusion.cpp:1785 - decoding 1 latents
[INFO ] stable-diffusion.cpp:1795 - latent 1 decoded, taking 1.51s
[INFO ] stable-diffusion.cpp:1799 - decode_first_stage completed, taking 1.51s
[INFO ] stable-diffusion.cpp:1818 - txt2img completed in 11.86s
save result image to 'output.png'

[errnoh@desk:~/dev/AI/stable-diffusion.cpp]$ ./result/bin/sd -m /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors   -p "a lovely cat" -H 1024 -W 1024
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
[INFO ] stable-diffusion.cpp:171  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors'
[INFO ] model.cpp:726  - load /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:194  - Stable Diffusion 1.x 
[INFO ] stable-diffusion.cpp:200  - Stable Diffusion weight type: f32
[INFO ] stable-diffusion.cpp:421  - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:425  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors' completed, taking 1.11s
[INFO ] stable-diffusion.cpp:442  - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:553  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1608 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1718 - get_learned_condition completed, taking 24 ms
[INFO ] stable-diffusion.cpp:1734 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1738 - generating image: 1/1 - seed 42
CUDA error: invalid configuration argument
  current device: 0, in function ggml_cuda_op_scale at /build/stable-diffusion.cpp/ggml/src/ggml-cuda.cu:10030
  hipGetLastError()
GGML_ASSERT: /build/stable-diffusion.cpp/ggml/src/ggml-cuda.cu:255: !"CUDA error"
Aborted (core dumped)

[errnoh@desk:~/dev/AI/stable-diffusion.cpp]$ ./result/bin/sd -m /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors   -p "a lovely cat" -H 1280 -W 1280
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7900 XTX, compute capability 11.0, VMM: no
[INFO ] stable-diffusion.cpp:171  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors'
[INFO ] model.cpp:726  - load /storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:194  - Stable Diffusion 1.x 
[INFO ] stable-diffusion.cpp:200  - Stable Diffusion weight type: f32
[INFO ] stable-diffusion.cpp:421  - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:425  - loading model from '/storage/data1/ML/StableDiffusion/models/hub/v1-5-pruned.safetensors' completed, taking 1.09s
[INFO ] stable-diffusion.cpp:442  - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:553  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1608 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1718 - get_learned_condition completed, taking 25 ms
[INFO ] stable-diffusion.cpp:1734 - sampling using Euler A method
[INFO ] stable-diffusion.cpp:1738 - generating image: 1/1 - seed 42
Memory access fault by GPU node-1 (Agent handle: 0x557c4d24b7d0) on address 0x7fc5fdc8b000. Reason: Page not present or supervisor privilege.
Aborted (core dumped)

DGdev91 commented 4 months ago

Same here, HIPblas, RX 7900XT the maximum i managed to make on SD 1.5 is 960x1024, while on SDXL i managed to make a 1920x1920 picture, before encountering the same issue.

leejet / stable-diffusion.cpp

"CUDA error" when set resolution higher than 1280 x 1280 #156