`make setup_cuda` getting error

leowenlu commented 3 months ago

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

Python 3.10.14

Poetry (version 1.7.0)

I am getting the following errors, any clue?

  collect2: error: ld returned 1 exit status
  ninja: build stopped: subcommand failed.

  *** CMake build failed
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for llama-cpp-python

Failed to build llama-cpp-python ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (llama-cpp-python)

umbertogriffo commented 3 months ago

Hi @leowenlu!

Unfortunately, I have never gotten this error so far, but we can try two things:

Could you try to downgrade the lama_cpp version to 0.2.76 here and replace CMAKE_ARGS="-DGGML_CUDA=on" with CMAKE_ARGS="-DLLAMA_CUBLAS=on" here and also add -v in pip3 install llama-cpp-python==$(llama_cpp_version) to enable the verbose output?
Make sure you have the latest version of CMake installed, as an outdated version can cause build issues.

Also, some system packages may be required for the build process. On Ubuntu, for example, you might need to install the following:

sudo apt-get update
sudo apt-get install build-essential cmake libopenblas-dev

leowenlu commented 3 months ago

Hi @umbertogriffo

Follow your instruction, pined lama_cpp version to 0.2.76, but still remain CMAKE_ARGS="-DGGML_CUDA=on".

llama_cpp_python==0.2.76 pyllamacpp==1.0.7

I have passed make setup_cuda and make update

But with streamlit run chatbot/chatbot_app.py -- --model llama-3 --max-new-tokens 1024, I am getting the following error:

llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291,

details:


llama_model_loader: loaded meta data with 33 key-value pairs and 292 tensors from /data/leoprojects/github/rag-chatbot/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 15
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - kv  29:                      quantize.imatrix.file str              = /models_out/Meta-Llama-3.1-8B-Instruc...
llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.58 GiB (4.89 BPW) 
llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 ''
llm_load_print_meta: EOS token        = 128009 ''
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 ''
llm_load_tensors: ggml ctx size =    0.15 MiB
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291
llama_load_model_from_file: failed to load model
[134786771306176] 2024-08-02 10:05:33,861 - __main__ - ERROR - An error occurred: Failed to load model from file: /data/leoprojects/github/rag-chatbot/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
Traceback (most recent call last):
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 264, in _get_or_create_cached_value
    cached_result = cache.read_result(value_key)
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_resource_api.py", line 500, in read_result
    raise CacheKeyNotFoundError()
streamlit.runtime.caching.cache_errors.CacheKeyNotFoundError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 312, in _handle_cache_miss
    cached_result = cache.read_result(value_key)
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_resource_api.py", line 500, in read_result
    raise CacheKeyNotFoundError()
streamlit.runtime.caching.cache_errors.CacheKeyNotFoundError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/leoprojects/github/rag-chatbot/chatbot/chatbot_app.py", line 165, in <module>
    main(args)
  File "/data/leoprojects/github/rag-chatbot/chatbot/chatbot_app.py", line 86, in main
    llm = load_llm(client, model, model_folder)
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 212, in wrapper
    return cached_func(*args, **kwargs)
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 241, in __call__
    return self._get_or_create_cached_value(args, kwargs)
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 267, in _get_or_create_cached_value
    return self._handle_cache_miss(cache, value_key, func_args, func_kwargs)
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 321, in _handle_cache_miss
    computed_value = self._info.func(*func_args, **func_kwargs)
  File "/data/leoprojects/github/rag-chatbot/chatbot/chatbot_app.py", line 25, in load_llm
    llm = get_client(llm_client, model_folder=model_folder, model_settings=model_settings)
  File "/data/leoprojects/github/rag-chatbot/chatbot/bot/client/client_settings.py", line 40, in get_client
    return client(**kwargs)
  File "/data/leoprojects/github/rag-chatbot/chatbot/bot/client/lama_cpp_client.py", line 16, in __init__
    super().__init__(model_folder, model_settings)
  File "/data/leoprojects/github/rag-chatbot/chatbot/bot/client/llm_client.py", line 50, in __init__
    self.llm = self._load_llm()
  File "/data/leoprojects/github/rag-chatbot/chatbot/bot/client/lama_cpp_client.py", line 19, in _load_llm
    llm = Llama(model_path=str(self.model_path), **self.model_settings.config)
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/llama_cpp/llama.py", line 338, in __init__
    self._model = _LlamaModel(
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/llama_cpp/_internals.py", line 57, in __init__
    raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: /data/leoprojects/github/rag-chatbot/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
Stack (most recent call last):
  File "/data/systems/miniconda3/envs/chat-box-poc/lib/python3.10/threading.py", line 973, in _bootstrap
    self._bootstrap_inner()
  File "/data/systems/miniconda3/envs/chat-box-poc/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/data/systems/miniconda3/envs/chat-box-poc/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 286, in _run_script_thread
    self._run_script(request.rerun_data)
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 534, in _run_script
    exec(code, module.__dict__)
  File "/data/leoprojects/github/rag-chatbot/chatbot/chatbot_app.py", line 167, in <module>
    logger.error(f"An error occurred: {str(error)}", exc_info=True, stack_info=True)

leowenlu commented 3 months ago

by the way streamlit run chatbot/chatbot_app.py -- --model openchat-3.6 --max-new-tokens 1024 it worked as expected though, so it's llama3 causing issues?

umbertogriffo commented 3 months ago

I have passed make setup_cuda and make update.

Why have you run make update? Can you try to clean the environment running make clean and then just make setup_cuda?

it worked as expected though, so it's llama3 causing issues?

As far as I remember llama3 was working with that lama_ccp version. Let me try on my side.

BTW is still interesting that the installation fails with the latest lama_cppversion on your side.

umbertogriffo commented 3 months ago

by the way streamlit run chatbot/chatbot_app.py -- --model openchat-3.6 --max-new-tokens 1024 it worked as expected though, so it's llama3 causing issues?

Yeah, I confirm that llama_cpp_python==0.2.76 supports llama 3.1. I do think that running make update screwed up your environment.

umbertogriffo commented 3 months ago

@leowenlu about the initial error you got using the more updated lama_cpp version, it seems there is an open issue on the official repo. I decided to roll back to 0.2.76 until newer versions are more stable.

leowenlu commented 2 months ago

I have upgrade llama_cpp_python==0.2.85, it looks like I am able to get llama3 working now. Thanks for your help, looking forward to more beautiful codes and more release from this project. Very well done.

bouajajais commented 2 months ago

@leowenlu how did you make it work ? Did you upgrade with CMAKE_ARGS="-DGGML_CUDA=on" and you are using the GPU with llama_cpp_python==0.2.85 with llama3.1 ?

leowenlu commented 2 months ago

CMAKE_ARGS="-DGGML_CUDA=on" and llama_cpp_python==0.2.85 with llama3.1. now it looks like working. @bouajajais

umbertogriffo / rag-chatbot

`make setup_cuda` getting error #6

Python 3.10.14