leejet / stable-diffusion.cpp

Stable Diffusion and Flux in pure C/C++
MIT License
3.38k stars 287 forks source link

OpenCL seems to almost work #48

Open Happenedtostumblein opened 1 year ago

Happenedtostumblein commented 1 year ago

@leejet @Green-Sky @ggerganov

I do not know cpp and do not have a solid grasp on how ggml works. , but building the repo with cmake -dggml_clblast=ON seems to work as the GPU utilization goes up and it’s very fast (10s vs 80s per step on a higher end CPU). It does complete all the steps and completes sampling too, but then crashes at line 1505 of ggml-opencl.

If it is a matter of spending time to make this work, is it simple enough for one of you to explain what needs to be done? If so, would be happy to give it a shot but don’t know where to start.

My limited understanding is that sampling is what takes all the effort, so is there a way to maybe switch from GPU to CPU to save the file? Or am I missing some context/knowledge?

Edit: Fixed typo. Flag used is clblast, not openblas.

ggerganov commented 1 year ago

Try this patch: https://github.com/ggerganov/llama.cpp/commit/6460f758dbd472653296044d36bed8c4554988f5

Happenedtostumblein commented 1 year ago

@ggerganov That worked, thank you!

Is it proper protocol to submit a pull request for a one-liner?

Edit: FYI: It allows entire process to complete, but does not actually make use of GPU.

Happenedtostumblein commented 1 year ago

FYI: It does work, but GPU utilization is very low. Got any more simple speedups in your pocket? @ggerganov

daniandtheweb commented 1 year ago

I'm sorry to disappoint you but openblas doesn't use the gpu to accelerate the processing but it uses the cpu itself. If anything you should try DGGML-CLBLAST=ON in order to use OpenCL but it still wouldn't work as the developer still hasn't integrated any gpu acceleration into the program.

Happenedtostumblein commented 1 year ago

@DaniAndTheWeb Thanks for pointing that out…it was a typo and the CLBLAST flag is what I was referring to.

How difficult/time-sensitive of a task is it going to be to incorporate OpenCL? With that flag, the gpu does get some kind of signal because utilization increases.

Just wondering if it’s a very involved process, or if we just need to copy/paste something from llama and/or ggml?

daniandtheweb commented 1 year ago

I'm no expert in opencl but it will require some time, it's not just a copy/paste. The good news is that with the current ram usage the gpu acceleration will probably be one of the more memory efficient.

Happenedtostumblein commented 1 year ago

@DaniAndTheWeb Can you tell me broadly speaking what tasks need to be completed, like I’m a 5?

Maybe CodeLlama can help me contribute a pull request to get it done, but I need a thread to grab onto. (Not sure if tagging is necessary, new to Github)

daniandtheweb commented 1 year ago

As I told you I don't know a lot about how the OpenCL implementation works but you probably have to implement each computing kernel of the stock cpu code in opencl. You can take a look at llama.cpp's implementation but you will need to make lots of tweaks to the code to make it work with this project.

Happenedtostumblein commented 1 year ago

No problem, hold my beer.

<<only really knows python>>

FNsi commented 1 year ago

Try this patch: https://github.com/ggerganov/llama.cpp/commit/6460f758dbd472653296044d36bed8c4554988f5

I can confirm that really work!

rayrayraykk commented 11 months ago

@leejet @Green-Sky @ggerganov

I do not know cpp and do not have a solid grasp on how ggml works. , but building the repo with cmake -dggml_clblast=ON seems to work as the GPU utilization goes up and it’s very fast (10s vs 80s per step on a higher end CPU). It does complete all the steps and completes sampling too, but then crashes at line 1505 of ggml-opencl.

If it is a matter of spending time to make this work, is it simple enough for one of you to explain what needs to be done? If so, would be happy to give it a shot but don’t know where to start.

My limited understanding is that sampling is what takes all the effort, so is there a way to maybe switch from GPU to CPU to save the file? Or am I missing some context/knowledge?

Edit: Fixed typo. Flag used is clblast, not openblas.

Use OpenCL on Android, and it gets slower. What device are you using? image

superkuh commented 9 months ago

I applied the patch and then added some ifdef SD_USE_CLBLAST include "ggml-opencl.h" ... etc, edited cmakelist file with bits from llama.cpp's clblast ported over and renamed/re-pointed, then configured with cmake .. -DGGML_OPENBLAS=ON -DGGML_CLBLAST=ON. Now compiled ./sd recognizes my AMD RX 580 GPU and I get about a 30% speed up. Not a huge increase since that's the same number of CPU threads + GPU, but my GPU is pretty old too. And it does seem take some load off CPU which is nice. Thanks!