ggml_cuda_compute_forward: SCALE failed. CUDA error: invalid configuration argument

./bin/sd -m ../models/sd-v1-4.ckpt --cfg-scale 5 --steps 30 --sampling-method euler  -H 1024 -W 1024 --seed 42 -p "fantasy medieval village world inside a glass sphere , high detail, fantasy, realistic, light effect, hyper detail, volumetric lighting, cinematic, macro, depth of field, blur, red light and clouds from the back, highly detailed epic cinematic concept art cg render made in maya, blender and photoshop, octane render, excellent composition, dynamic dramatic cinematic lighting, aesthetic, very inspirational, world inside a glass sphere by james gurney by artgerm with james jean, joe fenton and tristan eaton by ross tran, fine details, 4k resolution"

results are:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 8 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
  Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
  Device 2: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
  Device 3: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
  Device 4: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
  Device 5: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
  Device 6: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
  Device 7: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
[INFO ] stable-diffusion.cpp:195  - loading model from '../models/sd-v1-4.ckpt'
[INFO ] model.cpp:796  - load ../models/sd-v1-4.ckpt using checkpoint format
[INFO ] stable-diffusion.cpp:235  - Version: SD 1.x 
[INFO ] stable-diffusion.cpp:266  - Weight type:                 f32
[INFO ] stable-diffusion.cpp:267  - Conditioner weight type:     f32
[INFO ] stable-diffusion.cpp:268  - Diffusion model weight type: f32
[INFO ] stable-diffusion.cpp:269  - VAE weight type:             f32
[INFO ] stable-diffusion.cpp:482  - total params memory size = 2719.24MB (VRAM 2719.24MB, RAM 0.00MB): clip 469.44MB(VRAM), unet 2155.33MB(VRAM), vae 94.47MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
[INFO ] stable-diffusion.cpp:501  - loading model from '../models/sd-v1-4.ckpt' completed, taking 11.04s
[INFO ] stable-diffusion.cpp:528  - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:655  - Attempting to apply 0 LoRAs
[INFO ] stable-diffusion.cpp:1127 - apply_loras completed, taking 0.00s
[INFO ] stable-diffusion.cpp:1251 - get_learned_condition completed, taking 65 ms
[INFO ] stable-diffusion.cpp:1274 - sampling using Euler method
[INFO ] stable-diffusion.cpp:1278 - generating image: 1/1 - seed 42
ggml_cuda_compute_forward: SCALE failed
CUDA error: invalid configuration argument
  current device: 0, in function ggml_cuda_compute_forward at /root/share/stable-diffusion.cpp/ggml/src/ggml-cuda.cu:2326
  err
/root/share/stable-diffusion.cpp/ggml/src/ggml-cuda.cu:102: CUDA error
Aborted (core dumped)

I am using cuda_12.2, titan 2080 Ti GPU.

leejet / stable-diffusion.cpp

ggml_cuda_compute_forward: SCALE failed. CUDA error: invalid configuration argument #440