leejet / stable-diffusion.cpp

Stable Diffusion and Flux in pure C/C++
MIT License
3.43k stars 294 forks source link

Extremely slow performance on Ryzen 7950X3D #7

Closed n00mkrad closed 1 year ago

n00mkrad commented 1 year ago

Running the line from the readme, I get this:

step 1 sampling completed, taking 50.97s

Compiled with cmake on Windows. Shouldn't it be a little bit faster?

klosax commented 1 year ago

See my tests here https://github.com/leejet/stable-diffusion.cpp/issues/6

leejet commented 1 year ago

Yes, I'm trying to modify GGML to make it run faster. Could you add the -v parameter to print out your System Info and Options so I can take a look?

n00mkrad commented 1 year ago

Yes, I'm trying to modify GGML to make it run faster. Could you add the -v parameter to print out your System Info and Options so I can take a look?

Option:
    n_threads:       32
    mode:            txt2img
    model_path:      models/sd-1.5-ggml-model-q4_1.bin
    output_path:     output.png
    init_img:
    prompt:          photo of a lovely cat, high quality
    negative_prompt: blurry, ugly, jpeg compression, artifacts, unsharp
    cfg_scale:       7.50
    width:           512
    height:          512
    sample_method:   eular a
    sample_steps:    20
    strength:        0.75
    seed:            1
System Info:
    BLAS = 0
    SSE3 = 0
    AVX = 0
    AVX2 = 0
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 0
    NEON = 0
    ARM_FMA = 0
    F16C = 0
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[INFO]  stable-diffusion.cpp:2500 - loading model from 'models/sd-1.5-ggml-model-q4_1.bin'
[DEBUG] stable-diffusion.cpp:2508 - verifying magic
[DEBUG] stable-diffusion.cpp:2519 - loading hparams
[INFO]  stable-diffusion.cpp:2525 - ftype: q4_1
[DEBUG] stable-diffusion.cpp:2531 - loading vocab
[DEBUG] stable-diffusion.cpp:2569 - ggml tensor size = 240 bytes
[INFO]  stable-diffusion.cpp:2570 - params ctx size =  1454.75 MB
[DEBUG] stable-diffusion.cpp:2587 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:2602 - loading weights
[DEBUG] stable-diffusion.cpp:2712 - model size =  1454.34MB
[INFO]  stable-diffusion.cpp:2715 - loading model from 'models/sd-1.5-ggml-model-q4_1.bin' completed, taking 1.03s
[DEBUG] stable-diffusion.cpp:353  - split prompt "photo of a lovely cat, high quality" to tokens ["photo</w>", "of</w>", "a</w>", "lovely</w>", "cat</w>", ",</w>", "high</w>", "quality</w>", ]
[DEBUG] stable-diffusion.cpp:2752 - condition context need 1.46MB static memory, with work_size needing 0.28MB
[DEBUG] stable-diffusion.cpp:2776 - building condition graph completed: 633 nodes, 223 leafs
[DEBUG] stable-diffusion.cpp:2783 - computing condition graph completed, taking 0.72s
[INFO]  stable-diffusion.cpp:2796 - condition graph use 4.39MB of memory: static 1.46MB, dynamic = 2.93MB
[DEBUG] stable-diffusion.cpp:2797 - 236544 bytes of dynamic memory has not been released yet
[DEBUG] stable-diffusion.cpp:353  - split prompt "blurry, ugly, jpeg compression, artifacts, unsharp" to tokens ["blurry</w>", ",</w>", "ugly</w>", ",</w>", "<|endoftext|>", "compression</w>", ",</w>", "artifacts</w>", ",</w>", "<|endoftext|>", ]
[DEBUG] stable-diffusion.cpp:2752 - condition context need 1.46MB static memory, with work_size needing 0.28MB
[DEBUG] stable-diffusion.cpp:2776 - building condition graph completed: 633 nodes, 223 leafs
[DEBUG] stable-diffusion.cpp:2783 - computing condition graph completed, taking 0.37s
[INFO]  stable-diffusion.cpp:2796 - condition graph use 4.39MB of memory: static 1.46MB, dynamic = 2.93MB
[DEBUG] stable-diffusion.cpp:2797 - 236544 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3243 - get_learned_condition completed, taking 1.10s
[INFO]  stable-diffusion.cpp:3253 - start sampling
[DEBUG] stable-diffusion.cpp:2848 - diffusion context need 69.53MB static memory, with work_size needing 67.50MB
[INFO]  stable-diffusion.cpp:2989 - step 1 sampling completed, taking 42.43s
[DEBUG] stable-diffusion.cpp:2993 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
Green-Sky commented 1 year ago

System Info: BLAS = 0 SSE3 = 0 AVX = 0 AVX2 = 0 AVX512 = 0 AVX512_VBMI = 0 AVX512_VNNI = 0 FMA = 0 NEON = 0 ARM_FMA = 0 F16C = 0 FP16_VA = 0

how did you build your sd ? some features here should be enabled on any platform (AVX2 on almost all x86 cpus out there)

n00mkrad commented 1 year ago

System Info: BLAS = 0 SSE3 = 0 AVX = 0 AVX2 = 0 AVX512 = 0 AVX512_VBMI = 0 AVX512_VNNI = 0 FMA = 0 NEON = 0 ARM_FMA = 0 F16C = 0 FP16_VA = 0

how did you build your sd ? some features here should be enabled on any platform (AVX2 on almost all x86 cpus out there)

Installed cmake and ran the commands from the readme. I'm trying it again right now after installing CUDA and using cmake .. -DGGML_CUBLAS=ON.

Green-Sky commented 1 year ago

also, your number of threads seems excessive, try reducing that to match the physical core count.

n00mkrad commented 1 year ago

also, your number of threads seems excessive, try reducing that to match the physical core count.

The default only gave me around 60% utilization. But yeah I think 32 is too much. Didn't impact performance either way though.

n00mkrad commented 1 year ago

My compile log:

MSBuild version 17.6.3+07e294721 for .NET Framework

  1>Checking Build System
  Building Custom Rule stable-diffusion.cpp/ggml/src/CMakeLists.txt
  Compiling CUDA source file ..\..\..\ggml\src\ggml-cuda.cu...

  stable-diffusion.cpp\build\ggml\src>"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\bin\nvcc.exe"  --use-local-env -ccbin "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\
  HostX64\x64" -x cu   -I"stable-diffusion.cpp\ggml\src\." -I"stable-diffusion.cpp\ggml\src\..\include" -I"stable-diffusion.cpp\ggml\src\..\include\ggml" -I"C:\Program Files\NVIDIA GPU Comput
  ing Toolkit\CUDA\v11.8\include" -I"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.8\include"     --keep-dir x64\Release  -maxrregcount=0  --machine 64 --compile -cudart static --generate-code=arch=compute_52,code=[compute_52,s
  m_52] --generate-code=arch=compute_61,code=[compute_61,sm_61] -Xcompiler="/EHsc -Ob2"   -D_WINDOWS -DNDEBUG -DGGML_USE_CUBLAS -D"CMAKE_INTDIR=\"Release\"" -D_MBCS -DWIN32 -D_WINDOWS -DNDEBUG -DGGML_USE_CUBLAS -D"CMAKE_INTDIR=\"Release
  \"" -Xcompiler "/EHsc /W3 /nologo /O2 /Fdstable-diffusion.cpp\build\ggml\src\Release\ggml.pdb /FS   /MD /GR" -o ggml.dir\Release\ggml-cuda.obj "stable-diffusion.cpp\ggml\src\ggml-cuda.cu"
  ggml-cuda.cu
cl : command line  warning D9002: ignoring unknown option '-mfma' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
  ggml.c
cl : command line  warning D9002: ignoring unknown option '-mf16c' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
cl : command line  warning D9002: ignoring unknown option '-mavx' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
cl : command line  warning D9002: ignoring unknown option '-mavx2' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
  ggml.vcxproj -> stable-diffusion.cpp\build\ggml\src\Release\ggml.lib
  Building Custom Rule stable-diffusion.cpp/CMakeLists.txt
  stable-diffusion.cpp
  stable-diffusion.vcxproj -> stable-diffusion.cpp\build\Release\stable-diffusion.lib
  Building Custom Rule stable-diffusion.cpp/CMakeLists.txt
  main.cpp
stable-diffusion.cpp\stb_image_write.h(776,13): warning C4996: 'sprintf': This function or variable may be unsafe. Consider using sprintf_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for
 details. [stable-diffusion.cpp\build\sd.vcxproj]
  sd.vcxproj -> stable-diffusion.cpp\build\Release\sd.exe
  Building Custom Rule stable-diffusion.cpp/CMakeLists.txt

It's at 30 seconds per sampling step now. I wonder about this part:

 cl : command line  warning D9002: ignoring unknown option '-mfma' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
  ggml.c
cl : command line  warning D9002: ignoring unknown option '-mf16c' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
cl : command line  warning D9002: ignoring unknown option '-mavx' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]
cl : command line  warning D9002: ignoring unknown option '-mavx2' [stable-diffusion.cpp\build\ggml\src\ggml.vcxproj]

Does that imply it failed to enable AVX/AVX2 stuff?

Green-Sky commented 1 year ago

It's at 30 seconds per sampling step now. I wonder about this part:

....

Does that imply it failed to enable AVX/AVX2 stuff?

no, i think thats the cuda compiler.

Green-Sky commented 1 year ago

or maybe not? hm what is your platform/what platform are you building for

Green-Sky commented 1 year ago

ran it with my built + adjusted threads to 10 (i have 12 physical)

$ ./sd -t 10 -m ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin -p "photo of a lovely cat, high quality" -n "blurry, ugly, jpeg compression, artifacts, unsharp" -v
Option:
    n_threads:       10
    mode:            txt2img
    model_path:      ../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin
    output_path:     output.png
    init_img:
    prompt:          photo of a lovely cat, high quality
    negative_prompt: blurry, ugly, jpeg compression, artifacts, unsharp
    cfg_scale:       7.00
    width:           512
    height:          512
    sample_method:   eular a
    sample_steps:    20
    strength:        0.75
    seed:            42
System Info:
    BLAS = 0
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[INFO]  stable-diffusion.cpp:2500 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin'
[DEBUG] stable-diffusion.cpp:2508 - verifying magic
[DEBUG] stable-diffusion.cpp:2519 - loading hparams
[INFO]  stable-diffusion.cpp:2525 - ftype: q8_0
[DEBUG] stable-diffusion.cpp:2531 - loading vocab
[DEBUG] stable-diffusion.cpp:2569 - ggml tensor size = 240 bytes
[INFO]  stable-diffusion.cpp:2570 - params ctx size =  1618.72 MB
[DEBUG] stable-diffusion.cpp:2587 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:2602 - loading weights
[DEBUG] stable-diffusion.cpp:2712 - model size =  1618.31MB
[INFO]  stable-diffusion.cpp:2715 - loading model from '../models/v1-5-pruned-emaonly-ggml-model-q8_0.bin' completed, taking 0.44s
[DEBUG] stable-diffusion.cpp:353  - split prompt "photo of a lovely cat, high quality" to tokens ["photo</w>", "of</w>", "a</w>", "lovely</w>", "cat</w>", ",</w>", "high</w>", "quality</w>", ]
[DEBUG] stable-diffusion.cpp:2750 - condition context need 1.41MB static memory, with work_size needing 0.24MB
[DEBUG] stable-diffusion.cpp:2775 - building condition graph completed: 633 nodes, 223 leafs
[DEBUG] stable-diffusion.cpp:2783 - computing condition graph completed, taking 0.05s
[INFO]  stable-diffusion.cpp:2793 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[DEBUG] stable-diffusion.cpp:2797 - 236544 bytes of dynamic memory has not been released yet
[DEBUG] stable-diffusion.cpp:353  - split prompt "blurry, ugly, jpeg compression, artifacts, unsharp" to tokens ["blurry</w>", ",</w>", "ugly</w>", ",</w>", "<|endoftext|>", "compression</w>", ",</w>", "artifacts</w>", ",</w>", "<|endoftext|>", ]
[DEBUG] stable-diffusion.cpp:2750 - condition context need 1.41MB static memory, with work_size needing 0.24MB
[DEBUG] stable-diffusion.cpp:2775 - building condition graph completed: 633 nodes, 223 leafs
[DEBUG] stable-diffusion.cpp:2783 - computing condition graph completed, taking 0.05s
[INFO]  stable-diffusion.cpp:2793 - condition graph use 4.34MB of memory: static 1.41MB, dynamic = 2.93MB
[DEBUG] stable-diffusion.cpp:2797 - 236544 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3243 - get_learned_condition completed, taking 0.10s
[INFO]  stable-diffusion.cpp:3253 - start sampling
[DEBUG] stable-diffusion.cpp:2846 - diffusion context need 69.53MB static memory, with work_size needing 67.50MB
[INFO]  stable-diffusion.cpp:2989 - step 1 sampling completed, taking 15.96s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 2 sampling completed, taking 15.68s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 3 sampling completed, taking 15.83s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 4 sampling completed, taking 15.90s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 5 sampling completed, taking 15.93s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 6 sampling completed, taking 15.79s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 7 sampling completed, taking 15.78s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 8 sampling completed, taking 15.66s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 9 sampling completed, taking 15.71s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 10 sampling completed, taking 15.85s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 11 sampling completed, taking 15.78s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 12 sampling completed, taking 15.76s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 13 sampling completed, taking 15.85s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 14 sampling completed, taking 15.90s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 15 sampling completed, taking 15.84s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 16 sampling completed, taking 16.07s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 17 sampling completed, taking 15.88s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 18 sampling completed, taking 15.98s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 19 sampling completed, taking 15.89s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:2989 - step 20 sampling completed, taking 15.76s
[DEBUG] stable-diffusion.cpp:2990 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:2994 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3001 - diffusion graph use 623.74MB of memory: static 69.53MB, dynamic = 554.21MB
[DEBUG] stable-diffusion.cpp:3005 - 65536 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3258 - sampling completed, taking 316.81s
[DEBUG] stable-diffusion.cpp:3153 - vae context need 1153.12MB static memory, with work_size needing 1152.00MB
[DEBUG] stable-diffusion.cpp:3179 - computing vae graph completed, taking 50.49s
[INFO]  stable-diffusion.cpp:3188 - vae graph use 2177.12MB of memory: static 1153.12MB, dynamic = 1024.00MB
[DEBUG] stable-diffusion.cpp:3192 - 3145728 bytes of dynamic memory has not been released yet
[INFO]  stable-diffusion.cpp:3265 - decode_first_stage completed, taking 50.53s
[INFO]  stable-diffusion.cpp:3266 - txt2img completed in 367.45s, with a runtime memory usage of 2177.12MB and parameter memory usage of 1618.58MB
save result image to 'output.png'

output

edit: also i used q8_0 instead of q4_1

n00mkrad commented 1 year ago

Well this part is definitely off:

System Info:
    BLAS = 1
    SSE3 = 0
    AVX = 0
    AVX2 = 0
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 0
    NEON = 0
    ARM_FMA = 0
    F16C = 0
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0

I assume the lack of AVX is a compiler issue. But no idea how to fix that, it seems to be up to date.

Green-Sky commented 1 year ago

OH, tell me more about your build environment/process because it is trying to enable gcc/clang avx2 (-mavx2) on a msvc compiler

Green-Sky commented 1 year ago

If you poke around in your build directoy, you should fine a CMakeCache.txt, inside there you can add /arch:AVX2 to CMAKE_CXX_FLAGS:STRING= and CMAKE_C_FLAGS:STRING=

n00mkrad commented 1 year ago

OH, tell me more about your build environment/process because it is trying to enable gcc/clang avx2 (-mavx2) on a msvc compiler

Windows 10 22H2, VS 2022 with Build Tools installed, CUDA Toolkit 11.8 installed, cmake installed using their setup.

If you poke around in your build directoy, you should fine a CMakeCache.txt, inside there you can add /arch:AVX2 to CMAKE_CXX_FLAGS:STRING= and CMAKE_C_FLAGS:STRING=

That didn't seem to change anything. I ran cmake --build . --config Release again and same result.

Green-Sky commented 1 year ago

very funky, @leejet i will probably make a pr later with improved cmake (by copying from llama.cpp)

leejet commented 1 year ago

very funky, @leejet i will probably make a pr later with improved cmake (by copying from llama.cpp)

The latest GGML code has already fixed this issue. I will rebase my code onto the latest GGML code.

leejet commented 1 year ago

Does that imply it failed to enable AVX/AVX2 stuff?

@n00mkrad the issue has been fixed. You can pull the latest code and give it a try. Don't forget to update the submodule as well.

git pull origin master
git submodule update
n00mkrad commented 1 year ago

Works. Still very slow, but I guess that's expected. About 7 sec per step with CuBLAS, 30 sec without.