umbertogriffo / rag-chatbot

RAG (Retrieval-augmented generation) ChatBot that provides answers based on contextual information extracted from a collection of Markdown files.
Apache License 2.0
174 stars 35 forks source link

`make setup_cuda` getting error #6

Closed leowenlu closed 2 months ago

leowenlu commented 3 months ago

nvcc --version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Fri_Jan__6_16:45:21_PST_2023
Cuda compilation tools, release 12.0, V12.0.140
Build cuda_12.0.r12.0/compiler.32267302_0

Python 3.10.14

Poetry (version 1.7.0)

I am getting the following errors, any clue?

  collect2: error: ld returned 1 exit status
  ninja: build stopped: subcommand failed.

  *** CMake build failed
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for llama-cpp-python

Failed to build llama-cpp-python ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (llama-cpp-python)

umbertogriffo commented 3 months ago

Hi @leowenlu!

Unfortunately, I have never gotten this error so far, but we can try two things:

Also, some system packages may be required for the build process. On Ubuntu, for example, you might need to install the following:

sudo apt-get update
sudo apt-get install build-essential cmake libopenblas-dev
leowenlu commented 3 months ago

Hi @umbertogriffo

Follow your instruction, pined lama_cpp version to 0.2.76, but still remain CMAKE_ARGS="-DGGML_CUDA=on".

llama_cpp_python==0.2.76 pyllamacpp==1.0.7

I have passed make setup_cuda and make update

But with streamlit run chatbot/chatbot_app.py -- --model llama-3 --max-new-tokens 1024, I am getting the following error:

llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291,

details:


llama_model_loader: loaded meta data with 33 key-value pairs and 292 tensors from /data/leoprojects/github/rag-chatbot/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 15
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - kv  29:                      quantize.imatrix.file str              = /models_out/Meta-Llama-3.1-8B-Instruc...
llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
llm_load_vocab: special tokens definition check successful ( 256/128256 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: n_ctx_train      = 131072
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 131072
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = Q4_K - Medium
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 4.58 GiB (4.89 BPW) 
llm_load_print_meta: general.name     = Meta Llama 3.1 8B Instruct
llm_load_print_meta: BOS token        = 128000 ''
llm_load_print_meta: EOS token        = 128009 ''
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOT token        = 128009 ''
llm_load_tensors: ggml ctx size =    0.15 MiB
llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 292, got 291
llama_load_model_from_file: failed to load model
[134786771306176] 2024-08-02 10:05:33,861 - __main__ - ERROR - An error occurred: Failed to load model from file: /data/leoprojects/github/rag-chatbot/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
Traceback (most recent call last):
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 264, in _get_or_create_cached_value
    cached_result = cache.read_result(value_key)
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_resource_api.py", line 500, in read_result
    raise CacheKeyNotFoundError()
streamlit.runtime.caching.cache_errors.CacheKeyNotFoundError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 312, in _handle_cache_miss
    cached_result = cache.read_result(value_key)
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_resource_api.py", line 500, in read_result
    raise CacheKeyNotFoundError()
streamlit.runtime.caching.cache_errors.CacheKeyNotFoundError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/leoprojects/github/rag-chatbot/chatbot/chatbot_app.py", line 165, in <module>
    main(args)
  File "/data/leoprojects/github/rag-chatbot/chatbot/chatbot_app.py", line 86, in main
    llm = load_llm(client, model, model_folder)
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 212, in wrapper
    return cached_func(*args, **kwargs)
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 241, in __call__
    return self._get_or_create_cached_value(args, kwargs)
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 267, in _get_or_create_cached_value
    return self._handle_cache_miss(cache, value_key, func_args, func_kwargs)
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/streamlit/runtime/caching/cache_utils.py", line 321, in _handle_cache_miss
    computed_value = self._info.func(*func_args, **func_kwargs)
  File "/data/leoprojects/github/rag-chatbot/chatbot/chatbot_app.py", line 25, in load_llm
    llm = get_client(llm_client, model_folder=model_folder, model_settings=model_settings)
  File "/data/leoprojects/github/rag-chatbot/chatbot/bot/client/client_settings.py", line 40, in get_client
    return client(**kwargs)
  File "/data/leoprojects/github/rag-chatbot/chatbot/bot/client/lama_cpp_client.py", line 16, in __init__
    super().__init__(model_folder, model_settings)
  File "/data/leoprojects/github/rag-chatbot/chatbot/bot/client/llm_client.py", line 50, in __init__
    self.llm = self._load_llm()
  File "/data/leoprojects/github/rag-chatbot/chatbot/bot/client/lama_cpp_client.py", line 19, in _load_llm
    llm = Llama(model_path=str(self.model_path), **self.model_settings.config)
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/llama_cpp/llama.py", line 338, in __init__
    self._model = _LlamaModel(
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/llama_cpp/_internals.py", line 57, in __init__
    raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: /data/leoprojects/github/rag-chatbot/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf
Stack (most recent call last):
  File "/data/systems/miniconda3/envs/chat-box-poc/lib/python3.10/threading.py", line 973, in _bootstrap
    self._bootstrap_inner()
  File "/data/systems/miniconda3/envs/chat-box-poc/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/data/systems/miniconda3/envs/chat-box-poc/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 286, in _run_script_thread
    self._run_script(request.rerun_data)
  File "/data/leoprojects/github/rag-chatbot/.venv/lib/python3.10/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 534, in _run_script
    exec(code, module.__dict__)
  File "/data/leoprojects/github/rag-chatbot/chatbot/chatbot_app.py", line 167, in <module>
    logger.error(f"An error occurred: {str(error)}", exc_info=True, stack_info=True)
leowenlu commented 3 months ago

by the way streamlit run chatbot/chatbot_app.py -- --model openchat-3.6 --max-new-tokens 1024 it worked as expected though, so it's llama3 causing issues?

umbertogriffo commented 3 months ago

I have passed make setup_cuda and make update.

Why have you run make update? Can you try to clean the environment running make clean and then just make setup_cuda?

it worked as expected though, so it's llama3 causing issues?

As far as I remember llama3 was working with that lama_ccp version. Let me try on my side.

BTW is still interesting that the installation fails with the latest lama_cppversion on your side.

umbertogriffo commented 3 months ago

by the way streamlit run chatbot/chatbot_app.py -- --model openchat-3.6 --max-new-tokens 1024 it worked as expected though, so it's llama3 causing issues?

Yeah, I confirm that llama_cpp_python==0.2.76 supports llama 3.1. I do think that running make update screwed up your environment.

umbertogriffo commented 3 months ago

@leowenlu about the initial error you got using the more updated lama_cpp version, it seems there is an open issue on the official repo. I decided to roll back to 0.2.76 until newer versions are more stable.

leowenlu commented 2 months ago

I have upgrade llama_cpp_python==0.2.85, it looks like I am able to get llama3 working now. Thanks for your help, looking forward to more beautiful codes and more release from this project. Very well done.

bouajajais commented 2 months ago

@leowenlu how did you make it work ? Did you upgrade with CMAKE_ARGS="-DGGML_CUDA=on" and you are using the GPU with llama_cpp_python==0.2.85 with llama3.1 ?

leowenlu commented 2 months ago

CMAKE_ARGS="-DGGML_CUDA=on" and llama_cpp_python==0.2.85 with llama3.1. now it looks like working. @bouajajais