llmware-ai / llmware

Unified framework for building enterprise RAG pipelines with small, specialized models
https://llmware-ai.github.io/llmware/
Apache License 2.0
4.63k stars 852 forks source link

GGUF models not utilising GPU on Windows #511

Closed shneeba closed 6 months ago

shneeba commented 6 months ago

As discussed in this issue it appears the GGUF models are not utilising the GPU.

Environment setup:

GPU

RTX 3090TI Driver Version (Nvidia Contol Panel) - 551.76 Driver Version (Device Manager) - 31.0.15.5176 CUDA version:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Feb__8_05:53:42_Coordinated_Universal_Time_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

OS

Windows 10 Pro Version - 22H2 Build version - 19045.4046 Windows Feature Experience Pack - 1000.19053.1000.0 Python - 3.11.0

I was actually looking through the improving gguf cuda exception handling pull request and noticed the mention of using nvidia-smi to get the version info. I nearly went down a rabbit hole thinking I had some old 12.4 versions lying around but this seems to be for what compatibility version it supports:

Wed Mar 13 19:47:55 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 551.76                 Driver Version: 551.76         CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                     TCC/WDDM  | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 3090 Ti   WDDM  |   00000000:0A:00.0  On |                  Off |
|  0%   45C    P0            100W /  450W |    2434MiB /  24564MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

I then noticed:

        #   (4) cuda found via:  torch.cuda.is_available() -> note: torch not used in GGUF processing, but is used
        #       here as a safety check to confirm that CUDA is found and linked correctly to Python - over time,
        #       we may identify a better way to make this check outside of Torch

This was the smoking gun I needed. Checking torch.cuda.is_available() returned False, I needed to run the following command (got from here):

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

As soon as I did this I was full steam ahead:

(.venv) PS C:\Users\USER\Documents\projects\MYPROJECT> python ..\llmware\examples\SLIM-Agents\agent-multistep-analysis.py
update: Starting example - agent-multistep-analysis
update: Loading sample files
update: attempting to create source input folder at path:  C:\Users\USER\llmware_data\tmp\example_microsoft
update: creating library and parsing source document
update: executing query to filter to key passages - ibm - results found - 20
update: Launching LLMfx process
step -  1 -     creating object - ready to start processing.
step -  2 -     loading tool - sentiment
update: confirmed CUDA drivers recognized- will load model on CUDA - 12.1
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
step -  3 -     loading tool - emotions
update: confirmed CUDA drivers recognized- will load model on CUDA - 12.1
step -  4 -     loading tool - tags
update: confirmed CUDA drivers recognized- will load model on CUDA - 12.1
step -  5 -     loading tool - ner
update: confirmed CUDA drivers recognized- will load model on CUDA - 12.1
step -  6 -     loading tool - answer
update: confirmed CUDA drivers recognized- will load model on CUDA - 12.1
step -  7 -     loading new processing text - 20 new entries
step -  8 -     executing function call - deploying - sentiment 
step -  9 -     executing function call - getting response - sentiment
                                 -- llm_response - {'sentiment': ['positive']}

You weren't lying about that speed difference, it's seriously quick!

It may be worth having something in the README about installing torch via this method for other Windows users.

Thanks again for your help @doberst. Very much appreciated!

I'm happy for you to close this issue straight away (user error after all) but thought you may like to know the root cause.

doberst commented 6 months ago

@shneeba- thrilled that it is up and running (and FAST!) .... I had to use the exact same line to install torch -pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121 - will add to the README ... (BTW, if you know a better way to validate that the local Python interpreter is connected to the CUDA drivers, we can switch from the pytorch check - but that was the best way that I could find ...)

shneeba commented 6 months ago

Thanks for getting that added in! I think using the pytorch check for now is a good enough validation but I'll have a think. I'll close this issue now.