GPU not being used during inference

StevenK-vz commented 3 months ago

Version affected: current version v0.7.1 (main)

I initially assumed the issue was with my system, outdated nvidia drivers, cuda etc. But after trying on 4 separate machines running different mixes of Debian 11, 12 and Ubuntu 20.4 and 22.04 I haven't be able to get it working properly.

The GPU is being used during indexing but not during interaction/inference. The GPU is used when using other projects like https://github.com/oobabooga/text-generation-webui, Yolo, CVAT etc.

I've tried different combinations of python 3.10 and 3.11 using venv and conda environments. I've installed pytorch manually. I've tried several CUDA version from 11.8 to 12.4.

Older installs of llmsearch work fine and use the GPU. I believe my first install was v0.4x. So it seems the issue may be a specific environment variable I've missed or possibly the pyllmsearch package itself.

I've installed llmsearch via pip and build it from the github repo, there was no difference. I'm not seeing any warning or notices that might indicate an issue with my machines configuration.

To reiterate, the 4 machines I've used all have working CUDA installs and are able to use the GPU when running projects like Yolo, CVAT and text-generation-webui etc.

Here is the output of one of the interaction sessions:

Collecting usage statistics. To deactivate, set browser.gatherUsageStats to false.

  You can now view your Streamlit app in your browser.

  Local URL: http://localhost:8501
  Network URL: http://10.2.18.6:8501
  External URL: http://208.249.143.90:8501

2024-07-03 08:59:02.691 | DEBUG    | __main__:<module>:243 - CONFIG FILE: configs/test.yaml
2024-07-03 08:59:04.574 | INFO     | __main__:reload_model:192 - Clearing state and re-loading model...
2024-07-03 08:59:05.000 | DEBUG    | __main__:reload_model:195 - Reload model got DOC CONFIG FILE NAME: configs/test.yaml
2024-07-03 08:59:05.000 | DEBUG    | __main__:reload_model:196 - Reload model got MODEL CONFIG FILE NAME: phi-3-mini-q4.yaml
2024-07-03 08:59:05.003 | INFO     | __main__:load_yaml_file:129 - Loading doc config from a file: configs/test.yaml
2024-07-03 08:59:05.011 | INFO     | __main__:load_yaml_file:129 - Loading doc config from a file: phi-3-mini-q4.yaml
2024-07-03 08:59:05.013 | INFO     | llmsearch.config:validate_params:283 - Loading model paramaters in configuration class LlamaModelConfig
2024-07-03 08:59:05.015 | INFO     | llmsearch.utils:set_cache_folder:47 - Setting SENTENCE_TRANSFORMERS_HOME folder: /home/steven/code/llmsearch-4/data/cache
2024-07-03 08:59:05.015 | INFO     | llmsearch.utils:set_cache_folder:50 - Setting TRANSFORMERS_CACHE folder: /home/steven/code/llmsearch-4/data/cache/transformers
2024-07-03 08:59:05.015 | INFO     | llmsearch.utils:set_cache_folder:51 - Setting HF_HOME: /home/steven/code/llmsearch-4/data/cache/hf_home
2024-07-03 08:59:05.015 | INFO     | llmsearch.utils:set_cache_folder:52 - Setting MODELS_CACHE_FOLDER: /home/steven/code/llmsearch-4/data/cache
2024-07-03 08:59:05.015 | INFO     | llmsearch.models.llama:model:131 - Loading model...
2024-07-03 08:59:05.015 | INFO     | llmsearch.models.llama:model:134 - Initializing LLAmaCPP model...
2024-07-03 08:59:05.015 | INFO     | llmsearch.models.llama:model:135 - {'n_ctx': 4196, 'n_batch': 4196, 'n_gpu_layers': 90}
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /data/language_models/airoboros-l2-13b-gpt4-1.4.1.Q4_K_M.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv   4:                          llama.block_count u32              = 40
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 13824
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 40
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 40
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 15
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type q4_K:  241 tensors
llama_model_loader: - type q6_K:   41 tensors
llm_load_vocab: special tokens cache size = 259
llm_load_vocab: token to piece cache size = 0.1684 MB
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 5120
llm_load_print_meta: n_head           = 40
llm_load_print_meta: n_head_kv        = 40
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 5120
llm_load_print_meta: n_embd_v_gqa     = 5120
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 13824
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 13B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 13.02 B
llm_load_print_meta: model size       = 7.33 GiB (4.83 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.18 MiB
llm_load_tensors:        CPU buffer size =  7500.85 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 4224
llama_new_context_with_model: n_batch    = 4196
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =  3300.00 MiB
llama_new_context_with_model: KV self size  = 3300.00 MiB, K (f16): 1650.00 MiB, V (f16): 1650.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.12 MiB
llama_new_context_with_model:        CPU compute buffer size =   378.26 MiB
llama_new_context_with_model: graph nodes  = 1286
llama_new_context_with_model: graph splits = 1
AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
Model metadata: {'tokenizer.ggml.unknown_token_id': '0', 'tokenizer.ggml.eos_token_id': '2', 'general.architecture': 'llama', 'llama.context_length': '4096', 'general.name': 'LLaMA v2', 'llama.embedding_length': '5120', 'llama.feed_forward_length': '13824', 'llama.attention.layer_norm_rms_epsilon': '0.000010', 'llama.rope.dimension_count': '128', 'llama.attention.head_count': '40', 'tokenizer.ggml.bos_token_id': '1', 'llama.block_count': '40', 'llama.attention.head_count_kv': '40', 'general.quantization_version': '2', 'tokenizer.ggml.model': 'llama', 'general.file_type': '15'}
Using fallback chat format: llama-2

You can see that only the CPU is being used, no mention of GPU caching etc.

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07              Driver Version: 550.90.07      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  Tesla T4                       Off |   00000000:12:00.0 Off |                    0 |
| N/A   33C    P0             27W /   70W |    1045MiB /  15360MiB |     32%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  Tesla T4                       Off |   00000000:37:00.0 Off |                    0 |
| N/A   30C    P8             10W /   70W |       3MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  Tesla T4                       Off |   00000000:86:00.0 Off |                    0 |
| N/A   31C    P8              9W /   70W |       3MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  Tesla T4                       Off |   00000000:AF:00.0 Off |                    0 |
| N/A   32C    P8              9W /   70W |       3MiB /  15360MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    198907      C   .../code/llmsearch-4/.venv/bin/python3       1130MiB |
+-----------------------------------------------------------------------------------------+

I am open to any ideas or suggestions, thanks!

StevenK-vz commented 3 months ago

The problem appears to be related to llama-cpp-python not being built with CUDA support.

Adding -DLLAMA_CUDA=on to the CMAKE_ARGS in setvars.sh file has fixed the issue for me. export CMAKE_ARGS="-DCMAKE_CUDA_COMPILER=$(which nvcc) -DLLAMA_CUBLAS=ON -DLLAMA_CUDA=ON"

To help others experiencing the same issue this was my path to success:

Create and activate a venv environment with Python 3.10 python3.10 -m venv .venv && source .venv/bin/activate Python 3.11 doesn't work for some reason.
Add -DLLAMA_CUDA=ON to the end of CMAKE_ARGS in setvars.sh (shown above)
run ./install_linux.sh (I removed --index-url https://download.pytorch.org/whl/cu118 so CUDA 12.x would be used.)
Run llmsearch as usual.

snexus commented 3 months ago

Thanks for taking the time to investigate it. I will update the setvars.sh and the documentation based on your findings.

You are right - llamacpp is being compiled during the installation and checks these flags to compile with the GPU support. Can you describe the issue with Python3.11?

In general, the project will continue to support llamacpp going forward, but would encourage people to use external frameworks that support OpenAI API for the inference (e.g. LiteLLM + Ollama). That way it will allow me to focus on the core RAG functionality, which is a main purpose of this project.

StevenK-vz commented 2 months ago

There wasn't an error when using Python 3.11, it just didn't compile with CUDA support apparently. I didn't dig into it. It could have been a faulty python env or similar. I was just happy to finally have CUDA working.

Do you plan on adding an example install & config instructions for using LiteLLM / Ollama? I usually just copy and paste from the documentation to get the project up running as quickly as possible to kick the tires etc.

snexus commented 2 months ago

The relevant config is here - https://llm-search.readthedocs.io/en/latest/configure_model.html#ollama-litellm

But can add more elaborate instructions

StevenK-vz commented 2 months ago

Oh nice, I skipped right over that. Thanks!

On Tue, Jul 16, 2024 at 8:54 AM Denis Lapchev @.***> wrote:

The relevant config is here - https://llm-search.readthedocs.io/en/latest/configure_model.html#ollama-litellm https://urldefense.proofpoint.com/v2/url?u=https-3A__llm-2Dsearch.readthedocs.io_en_latest_configure-5Fmodel.html-23ollama-2Dlitellm&d=DwMCaQ&c=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ&r=41mGGKjIvpAQ47MAf_GiI4EgZJx2U_TvbC326ALCGQE&m=mA98bl-vG5lEYNtieOjRfxGAr1NQNiQ1eG6WeVC45eL2qbBOS3vO1Dq4wj0bvGLY&s=GVHctI75SIhOaddik-oX9qKCAP2nbRQcEAiLktAAOHY&e=

But can add more elaborate instructions

— Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_snexus_llm-2Dsearch_issues_112-23issuecomment-2D2231288555&d=DwMCaQ&c=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ&r=41mGGKjIvpAQ47MAf_GiI4EgZJx2U_TvbC326ALCGQE&m=mA98bl-vG5lEYNtieOjRfxGAr1NQNiQ1eG6WeVC45eL2qbBOS3vO1Dq4wj0bvGLY&s=XuK1bW9S3cRp8lqd4R83tX5Koq5aTsEp3eq5xpyZLG8&e=, or unsubscribe https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_BJULSDPPD6EBM6JIWPFWZ2DZMU63ZAVCNFSM6AAAAABKJ5QK2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMZRGI4DQNJVGU&d=DwMCaQ&c=udBTRvFvXC5Dhqg7UHpJlPps3mZ3LRxpb6__0PomBTQ&r=41mGGKjIvpAQ47MAf_GiI4EgZJx2U_TvbC326ALCGQE&m=mA98bl-vG5lEYNtieOjRfxGAr1NQNiQ1eG6WeVC45eL2qbBOS3vO1Dq4wj0bvGLY&s=0_nVKpzeLmlqXsqQUgmMk-DY4ddCGNpImusjegMTfBE&e= . You are receiving this because you authored the thread.Message ID: @.***>

snexus / llm-search

GPU not being used during inference #112