marella / ctransformers

Python bindings for the Transformer models implemented in C/C++ using GGML library.
MIT License
1.79k stars 136 forks source link

Few questions / issues #27

Closed Mradr closed 1 year ago

Mradr commented 1 year ago

1) I just wanted to ask if you guys are planing to add MPT GPU support as well somet ime? I see its supported for LLAMA models. 2) Real reason for the ticket, I am having issue getting it to really use the GPU. Sometimes it works and sometimes it doenst. Not really sure how to explain, but: A) Windows 11, 32 GB of RAM, Ryzen 5800x, 13B-HyperMantis.ggmlv3.q5_1.bin, set to LLAMA, 12 Threads (give or take what I set here for setting reasons), RTX 2080 B) Install the model the other day. Tested on the CPU was able to get results back under 20 to 25 seconds. Saw there was GPU suported, uninstall and reinstall with CUDA support. Tested the GPU off loading and it didnt seem to do much in my first round of testing. I had set it to 25 layers at the time. Didnt see any improvement in speed, but could see that the GPU was being used with higher memory access and GPU usage spiking, but never capping at max. Lower the count to 15 layers. Tested again. This time was able to hit 5 to 10. Went crazy and tested it as much as I could getting really good results. Today, I rebooted my machine and its acting like it did the other day at 25. Tried lower it from 15 to 10 or below ... but it doesnt seem to be using the GPU yet "acting" or "setting" up the usage for the GPU as I can see the memory and inflex of usage - but never fully topping out.

I could be totally using it wrong, but the fact it was working the other day and today it stop tells me something change on my computer, but I honstly couldnt tell you what did. Didnt perform any updates, but its also weird it didnt work before then all at once it did. Not sure if there is some type of support model it needs or not. Cuda is supported under torch check. Any help or inforamtion is welcome:) I understand this is not a common issue. Any places I can check/get values for to see if its really working would be great. Just seems odd.

marella commented 1 year ago
  1. It depends on ggml/llama.cpp projects. There were some discussions in ggml (e.g. https://github.com/ggerganov/ggml/pull/145#issuecomment-1544733902) to add llama.cpp features to other models but I'm not sure when it will be done. Recently GPU support was added for Falcon models in ctransformers 0.2.10 using a new fork of llama.cpp cmp-nct/ggllm.cpp
  2. Ryzen 5800x seems to have only 8 cores so you should use 7 or 8 threads. Using more threads than available cores, might slow it down. Try to run as many layers as you can on GPU until you get out of memory error. Also try with q4_0.bin file as it might fit in your GPU. If the entire model fits on GPU then you should set threads=1 to avoid the overhead of creating CPU threads. When comparing speed, use a fixed seed (e.g. llm(..., seed=100)) so that it generates the same amount of text every time. Some other processes running on your system might also be causing the performance to vary by consuming more/less memory.
Mradr commented 1 year ago

Thanks! The fix seem to be increasing the batch_size from default to 256. I was able to set t=1 then and get more performance. 7-8 or 10-12 having more thread seems to improve performance than degrade it. I'm also not getting an out of memory error - it just "ooms" or stops trying to use the GPU and starts slowing down after a bit more testing and playing with the numbers. In this case, it still takes up the VRAM - but seems like it just doesnt do anything with the GPU at that point.