GGML_ASSERT error on ROCm RX 6800

nonetrix commented 4 months ago

I am trying to run on my RX 6800 on Arch Linux distrobox running inside of Arch Linux but am getting this error when I try to generate anything

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 6800, compute capability 10.3, VMM: no
[INFO ] stable-diffusion.cpp:142  - loading model from './models/PonyXL.safetensors'
[INFO ] model.cpp:676  - load ./models/PonyXL.safetensors using safetensors format
[INFO ] stable-diffusion.cpp:164  - Stable Diffusion XL
[INFO ] stable-diffusion.cpp:170  - Stable Diffusion weight type: f16
[WARN ] stable-diffusion.cpp:180  - !!!It looks like you are using SDXL model. If you find that the generated images are completely black, try specifying SDXL VAE FP16 Fix with the --vae parameter. You can find it here: https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/blob/main/sdxl_vae.safetensors
[INFO ] stable-diffusion.cpp:306  - total params memory size = 4693.07MB (clip 1564.36MB, unet 4900.07MB, vae 94.47MB, controlnet 0.00MB)
[INFO ] stable-diffusion.cpp:310  - loading model from './models/PonyXL.safetensors' completed, taking 1.80s
[INFO ] stable-diffusion.cpp:327  - running in eps-prediction mode
[INFO ] stable-diffusion.cpp:1374 - apply_loras completed, taking 0.00s
CUDA error: shared object initialization failed
  current device: 0, in function ggml_cuda_op_flatten at /home/noah/Documents/stable-diffusion.cpp/ggml/src/ggml-cuda.cu:10106
  hipGetLastError()
GGML_ASSERT: /home/noah/Documents/stable-diffusion.cpp/ggml/src/ggml-cuda.cu:255: !"CUDA error"
zsh: IOT instruction (core dumped)  ./build/bin/sd -M txt2img -m ./models/PonyXL.safetensors -t 16 -p "test"

It looks like I am missing some dependency but I installed pretty much all the ROCm packages and Hipblas etc.

extra/comgr 6.0.0-1 [installed]
    Compiler support library for ROCm LLVM
extra/hip-runtime-amd 6.0.0-1 [installed]
    Heterogeneous Interface for Portability ROCm
extra/hipblas 6.0.0-1 [installed]
    ROCm BLAS marshalling library
extra/hsa-rocr 6.0.0-2 [installed]
    HSA Runtime API and runtime for ROCm
extra/magma-hip 2.7.2-3
    Matrix Algebra on GPU and Multicore Architectures (with ROCm/HIP)
extra/onnxruntime-opt-rocm 1.16.3-6
    Cross-platform, high performance scoring engine for ML models (with ROCm and AVX2 CPU optimizations)
extra/onnxruntime-rocm 1.16.3-6 [installed]
    Cross-platform, high performance scoring engine for ML models (with ROCm)
extra/python-onnxruntime-opt-rocm 1.16.3-6
    Cross-platform, high performance scoring engine for ML models (with ROCm and AVX2 CPU optimizations)
extra/python-onnxruntime-rocm 1.16.3-6
    Cross-platform, high performance scoring engine for ML models (with ROCm)
extra/python-pytorch-opt-rocm 2.2.0-2
    Tensors and Dynamic neural networks in Python with strong GPU acceleration (with ROCm and AVX2 CPU optimizations)
extra/python-pytorch-rocm 2.2.0-2
    Tensors and Dynamic neural networks in Python with strong GPU acceleration (with ROCm)
extra/rccl 6.0.0-1 [installed]
    ROCm Communication Collectives Library
extra/rocalution 6.0.0-2 [installed]
    Next generation library for iterative sparse solvers for ROCm platform
extra/rocblas 6.0.0-1 [installed]
    Next generation BLAS implementation for ROCm platform
extra/rocfft 6.0.0-1 [installed]
    Next generation FFT implementation for ROCm
extra/rocm-clang-ocl 6.0.0-1 [installed]
    OpenCL compilation with clang compiler
extra/rocm-cmake 6.0.0-1 [installed]
    CMake modules for common build tasks needed for the ROCm software stack
extra/rocm-core 6.0.0-2 [installed]
    AMD ROCm core package (version files)
extra/rocm-dbgapi 6.0.0-1
    Support library necessary for a debugger of AMD's GPUs
extra/rocm-device-libs 6.0.0-1 [installed]
    ROCm Device Libraries
extra/rocm-hip-libraries 6.0.0-1 [installed]
    Develop certain applications using HIP and libraries for AMD platforms
extra/rocm-hip-runtime 6.0.0-1 [installed]
    Packages to run HIP applications on the AMD platform
extra/rocm-hip-sdk 6.0.0-1 [installed]
    Develop applications using HIP and libraries for AMD platforms
extra/rocm-language-runtime 6.0.0-1 [installed]
    ROCm runtime
extra/rocm-llvm 6.0.0-2 [installed]
    Radeon Open Compute - LLVM toolchain (llvm, clang, lld)
extra/rocm-ml-libraries 6.0.0-1 [installed]
    Packages for key Machine Learning libraries
extra/rocm-ml-sdk 6.0.0-1 [installed]
    develop and run Machine Learning applications optimized for AMD platforms
extra/rocm-opencl-runtime 6.0.0-1 [installed]
    OpenCL implementation for AMD
extra/rocm-opencl-sdk 6.0.0-1 [installed]
    Develop OpenCL-based applications for AMD platforms
extra/rocm-smi-lib 6.0.0-1 [installed]
    ROCm System Management Interface Library
extra/rocminfo 6.0.0-1 [installed]
    ROCm Application for Reporting System Info
extra/rocrand 6.0.0-1 [installed]
    Pseudo-random and quasi-random number generator on ROCm
extra/rocsolver 6.0.0-1 [installed]
    Subset of LAPACK functionality on the ROCm platform
extra/rocsparse 6.0.0-1 [installed]
    BLAS for sparse computation on top of ROCm
extra/rocthrust 6.0.0-1 [installed]
    Port of the Thrust parallel algorithm library atop HIP/ROCm
extra/roctracer 6.0.0-1 [installed]
    ROCm tracer library for performance tracing

nonetrix commented 4 months ago

I have also reproduced this with SD 1.5, and SDXL model which is not based on PonyXL which is slightly different apparently. It does allocate VRAM, but then just crashes

DGdev91 commented 4 months ago

Try this before launching: export HSA_OVERRIDE_GFX_VERSION=10.3.0

It's a common workaround for the python version of SD, maybe it works in the same way

nonetrix commented 4 months ago

Hm never had to do that tbh maybe it's automatic though most of the time, I'll try that

DGdev91 commented 4 months ago

Hm never had to do that tbh maybe it's automatic though most of the time, I'll try that

In automatic1111's webui there are some lines in the start script that do that for you, as well in some apps like koboldcpp. .....but in theory it shouldn't be needed for your gpu.

Also, if you followed the guide in homepage, probably you have to recompile using a different DAMDGPU_TARGETS parameter. gfx1100 is for the 7000 series, while the codename for your gpu is gfx1030 (you can get those here, it's a bit outdated but still useful https://llvm.org/docs/AMDGPUUsage.html).

Then, try to do

cmake .. -G "Ninja" -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DSD_HIPBLAS=ON -DCMAKE_BUILD_TYPE=Release -DAMDGPU_TARGETS=gfx1030
cmake --build . --config Release

...Or maybe it's a completely different issue, but trying does not harm

nonetrix commented 4 months ago

Hm never had to do that tbh maybe it's automatic though most of the time, I'll try that

In automatic1111's webui there are some lines in the start script that do that for you, as well in some apps like koboldcpp. .....but in theory it shouldn't be needed for your gpu.

Also, if you followed the guide in homepage, probably you have to recompile using a different DAMDGPU_TARGETS parameter. gfx1100 is for the 7000 series, while the codename for your gpu is gfx1030 (you can get those here, it's a bit outdated but still useful https://llvm.org/docs/AMDGPUUsage.html).

Then, try to do
cmake .. -G "Ninja" -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++ -DSD_HIPBLAS=ON -DCMAKE_BUILD_TYPE=Release -DAMDGPU_TARGETS=gfx1030
cmake --build . --config Release
...Or maybe it's a completely different issue, but trying does not harm

That worked for me and is a lot faster, thanks. Maybe this should be added to the read me though? Also as a side effect gets around my issue with the PyTorch backend where my GPU will reset annoyingly due to a firmware blob bug that has been fixed but not released. That is unrelated mostly, but it only happens with ROCm or Vulkan compute which is rather annoying, and I've only triggered it on Vulkan llama cpp and any Stable Diffusion UI using PyTorch, but then it's fine with PyTorch LLMs it's a annoying mess just wanted to rant lol. I think I might make a terminal client for this if it isn't too hard, but I think I'll use iTerm image support which is also in Wezterm maybe add Kitty support too or other terminals not sure. Is there some kind of http API I could plugin to btw?

nonetrix commented 4 months ago

Also my GPU is so quiet when using this compared to PyTorch the fans barley spin up meanwhile PyTorch they max out and it's much slower

phudtran commented 4 months ago

Also my GPU is so quiet when using this compared to PyTorch the fans barley spin up meanwhile PyTorch they max out and it's much slower

cpp version is much slower or the PyTorch version?

nonetrix commented 4 months ago

PyTorch sorry for my garbage wording and grammar

leejet / stable-diffusion.cpp

GGML_ASSERT error on ROCm RX 6800 #192