h2oai / h2ogpt

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://gpt-docs.h2o.ai/
http://h2o.ai
Apache License 2.0
11.4k stars 1.25k forks source link

Linux install of h2ogpt--Require corrections in install Instructions #1628

Open harnalashok opened 5 months ago

harnalashok commented 5 months ago

The h2ogpt linux installation method as given here is as follows:

A. Variable export instructions:

export PIP_EXTRA_INDEX_URL="https://download.pytorch.org/whl/cu118 https://huggingface.github.io/autogptq-index/whl/cu118" export LLAMA_CUBLAS=1 export CMAKE_ARGS="-DLLAMA_CUBLAS=on -DCMAKE_CUDA_ARCHITECTURES=all" export FORCE_CMAKE=1

B. Then, one is required to run the following seven instructions

(numbers are given by me)

1. git clone https://github.com/h2oai/h2ogpt.git 2. cd h2ogpt
3. pip install -r requirements.txt 4. pip install -r reqs_optional/requirements_optional_langchain.txt

5. pip uninstall llama_cpp_python llama_cpp_python_cuda -y 6. pip install -r reqs_optional/requirements_optional_llamacpp_gpt4all.txt --no-cache-dir 7. pip install -r reqs_optional/requirements_optional_langchain.urls.txt

Executing instruction 6 and 7 results in the following error:

-- Configuring incomplete, errors occurred!

*** CMake configuration failed [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for llama-cpp-python Failed to build llama-cpp-python ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects

However, this error, is avoided if after executing instructions 1 to 4, the terminal is closed and then re-opened again. And then instructions 6 and 7 are executed. Meaning thereby, variable export instructions issued earlier result in generation of error in the execution of instructions 6 and 7.

This may please be rechecked at your end and installation document corrected accordingly. Ashok Kumar Harnal

pseudotensor commented 5 months ago

The terminal open-close can't matter. Probably it's not compiling the cuda version and you are only getting the CPU version. Can you give an expanded full version of your error from 6-7?

harnalashok commented 5 months ago

I have repeated the experiment three times. The behavior is the same as narrated by me before that there is a need to unexport one of the variables which have been made in paragraph A. above before I execute instruction numbered as 6 . Here is the complete trace of execution if I do not close the terminal but continue to work in the same terminal. (But if I close and open the terminal again, the process exceeds successfully. Please those results also below)

`(base) ashok@ashok:~/h2ogpt$ pip install -r reqs_optional/requirements_optional_llamacpp_gpt4all.txt --no-cache-dir Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu121, https://huggingface.github.io/autogptq-index/whl/cu121 Collecting gpt4all==1.0.5 (from -r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 1)) Downloading gpt4all-1.0.5-py3-none-manylinux1_x86_64.whl.metadata (912 bytes) Collecting llama-cpp-python==0.2.56 (from -r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 4)) Downloading llama_cpp_python-0.2.56.tar.gz (36.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36.9/36.9 MB 28.8 MB/s eta 0:00:00 Installing build dependencies ... done Getting requirements to build wheel ... done Installing backend dependencies ... done Preparing metadata (pyproject.toml) ... done Requirement already satisfied: requests in /home/ashok/anaconda3/lib/python3.11/site-packages (from gpt4all==1.0.5->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 1)) (2.31.0) Requirement already satisfied: tqdm in /home/ashok/anaconda3/lib/python3.11/site-packages (from gpt4all==1.0.5->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 1)) (4.66.4) Requirement already satisfied: typing-extensions>=4.5.0 in /home/ashok/anaconda3/lib/python3.11/site-packages (from llama-cpp-python==0.2.56->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 4)) (4.9.0) Requirement already satisfied: numpy>=1.20.0 in /home/ashok/anaconda3/lib/python3.11/site-packages (from llama-cpp-python==0.2.56->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 4)) (1.26.4) Collecting diskcache>=5.6.1 (from llama-cpp-python==0.2.56->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 4)) Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB) Requirement already satisfied: jinja2>=2.11.3 in /home/ashok/anaconda3/lib/python3.11/site-packages (from llama-cpp-python==0.2.56->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 4)) (3.1.3) Requirement already satisfied: MarkupSafe>=2.0 in /home/ashok/anaconda3/lib/python3.11/site-packages (from jinja2>=2.11.3->llama-cpp-python==0.2.56->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 4)) (2.1.3) Requirement already satisfied: charset-normalizer<4,>=2 in /home/ashok/anaconda3/lib/python3.11/site-packages (from requests->gpt4all==1.0.5->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 1)) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in /home/ashok/anaconda3/lib/python3.11/site-packages (from requests->gpt4all==1.0.5->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 1)) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in /home/ashok/anaconda3/lib/python3.11/site-packages (from requests->gpt4all==1.0.5->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 1)) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in /home/ashok/anaconda3/lib/python3.11/site-packages (from requests->gpt4all==1.0.5->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 1)) (2024.2.2) Downloading gpt4all-1.0.5-py3-none-manylinux1_x86_64.whl (3.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 9.1 MB/s eta 0:00:00 Downloading diskcache-5.6.3-py3-none-any.whl (45 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.5/45.5 kB 34.3 MB/s eta 0:00:00 Building wheels for collected packages: llama-cpp-python Building wheel for llama-cpp-python (pyproject.toml) ... error error: subprocess-exited-with-error

× Building wheel for llama-cpp-python (pyproject.toml) did not run successfully. │ exit code: 1 ╰─> [45 lines of output] scikit-build-core 0.9.4 using CMake 3.29.3 (wheel) Configuring CMake... 2024-05-22 09:52:26,643 - scikit_build_core - WARNING - Can't find a Python library, got libdir=/home/ashok/anaconda3/lib, ldlibrary=libpython3.11.a, multiarch=x86_64-linux-gnu, masd=None loading initial cache file /tmp/tmpnyigg61a/build/CMakeInit.txt -- The C compiler identification is GNU 11.4.0 -- The CXX compiler identification is GNU 11.4.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /usr/bin/cc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Check for working CXX compiler: /usr/bin/c++ - skipped -- Detecting CXX compile features -- Detecting CXX compile features - done -- Found Git: /usr/bin/git (found version "2.34.1") -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success -- Found Threads: TRUE -- Could not find nvcc, please set CUDAToolkit_ROOT. CMake Warning at vendor/llama.cpp/CMakeLists.txt:407 (message): cuBLAS not found

  -- CUDA host compiler is GNU
  CMake Error at vendor/llama.cpp/CMakeLists.txt:835 (get_flags):
    get_flags Function invoked with incorrect arguments for function named:
    get_flags

  -- Warning: ccache not found - consider installing it for faster compilation or disable this warning with LLAMA_CCACHE=OFF
  -- CMAKE_SYSTEM_PROCESSOR: x86_64
  -- x86 detected
  CMake Warning (dev) at CMakeLists.txt:21 (install):
    Target llama has PUBLIC_HEADER files but no PUBLIC_HEADER DESTINATION.
  This warning is for project developers.  Use -Wno-dev to suppress it.

  CMake Warning (dev) at CMakeLists.txt:30 (install):
    Target llama has PUBLIC_HEADER files but no PUBLIC_HEADER DESTINATION.
  This warning is for project developers.  Use -Wno-dev to suppress it.

  -- Configuring incomplete, errors occurred!

  *** CMake configuration failed
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for llama-cpp-python Failed to build llama-cpp-python ERROR: Could not build wheels for llama-cpp-python, which is required to install pyproject.toml-based projects ` Here is what happens if I execute instruction numbered as 6 after I close and open the terminal. No error:

pip install -r reqs_optional/requirements_optional_llamacpp_gpt4all.txt --no-cache-dir Collecting gpt4all==1.0.5 (from -r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 1)) Downloading gpt4all-1.0.5-py3-none-manylinux1_x86_64.whl.metadata (912 bytes) Collecting llama-cpp-python==0.2.56 (from -r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 4)) Downloading llama_cpp_python-0.2.56.tar.gz (36.9 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 36.9/36.9 MB 23.2 MB/s eta 0:00:00 Installing build dependencies ... done Getting requirements to build wheel ... done Installing backend dependencies ... done Preparing metadata (pyproject.toml) ... done Requirement already satisfied: requests in /home/ashok/anaconda3/lib/python3.11/site-packages (from gpt4all==1.0.5->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 1)) (2.31.0) Requirement already satisfied: tqdm in /home/ashok/anaconda3/lib/python3.11/site-packages (from gpt4all==1.0.5->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 1)) (4.66.4) Requirement already satisfied: typing-extensions>=4.5.0 in /home/ashok/anaconda3/lib/python3.11/site-packages (from llama-cpp-python==0.2.56->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 4)) (4.9.0) Requirement already satisfied: numpy>=1.20.0 in /home/ashok/anaconda3/lib/python3.11/site-packages (from llama-cpp-python==0.2.56->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 4)) (1.26.4) Collecting diskcache>=5.6.1 (from llama-cpp-python==0.2.56->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 4)) Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB) Requirement already satisfied: jinja2>=2.11.3 in /home/ashok/anaconda3/lib/python3.11/site-packages (from llama-cpp-python==0.2.56->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 4)) (3.1.3) Requirement already satisfied: MarkupSafe>=2.0 in /home/ashok/anaconda3/lib/python3.11/site-packages (from jinja2>=2.11.3->llama-cpp-python==0.2.56->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 4)) (2.1.3) Requirement already satisfied: charset-normalizer<4,>=2 in /home/ashok/anaconda3/lib/python3.11/site-packages (from requests->gpt4all==1.0.5->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 1)) (3.3.2) Requirement already satisfied: idna<4,>=2.5 in /home/ashok/anaconda3/lib/python3.11/site-packages (from requests->gpt4all==1.0.5->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 1)) (3.4) Requirement already satisfied: urllib3<3,>=1.21.1 in /home/ashok/anaconda3/lib/python3.11/site-packages (from requests->gpt4all==1.0.5->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 1)) (2.0.7) Requirement already satisfied: certifi>=2017.4.17 in /home/ashok/anaconda3/lib/python3.11/site-packages (from requests->gpt4all==1.0.5->-r reqs_optional/requirements_optional_llamacpp_gpt4all.txt (line 1)) (2024.2.2) Downloading gpt4all-1.0.5-py3-none-manylinux1_x86_64.whl (3.6 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 25.7 MB/s eta 0:00:00 Downloading diskcache-5.6.3-py3-none-any.whl (45 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 45.5/45.5 kB 31.4 MB/s eta 0:00:00 Building wheels for collected packages: llama-cpp-python Building wheel for llama-cpp-python (pyproject.toml) ... done Created wheel for llama-cpp-python: filename=llama_cpp_python-0.2.56-cp311-cp311-linux_x86_64.whl size=2827201 sha256=07293d75ff82ed6104572cae4fae96fc4fbb0f896b05211463ffd296aab81204 Stored in directory: /tmp/pip-ephem-wheel-cache-dxt7ajop/wheels/f5/48/62/014b1a3c38f77df21219f81ed63ca4c09531d52a205b15d8e4 Successfully built llama-cpp-python Installing collected packages: diskcache, llama-cpp-python, gpt4all Successfully installed diskcache-5.6.3 gpt4all-1.0.5 llama-cpp-python-0.2.56

pseudotensor commented 5 months ago

I see the - Could not find nvcc, please set CUDAToolkit_ROOT. and cuBLAS not found that means something is wrong with the cuda installation.

Try again installing cuda 12.1 and ensure CUDA_HOME is set etc.

harnalashok commented 5 months ago

Installing cuda12.1 solaves the issue. But then I face another error when I execute python generate.py. Here is the complete trace. Kindly help:

`
python generate.py --base_model=TheBloke/Mistral-7B-Instruct-v0.2-GGUF --prompt_type=mistral --max_seq_len=4096 /home/ashok/anaconda3/lib/python3.11/site-packages/pydub/utils.py:170: RuntimeWarning: Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work warn("Couldn't find ffmpeg or avconv - defaulting to ffmpeg, but may not work", RuntimeWarning) soundfile, librosa, and wavio not installed, disabling STT soundfile, librosa, and wavio not installed, disabling TTS Using Model llama load INSTRUCTOR_Transformer max_seq_length 512 Must install DocTR and LangChain installed if enabled DocTR, disabling Starting get_model: llama Failed to listen to n_gpus: No module named 'llama_cpp_cuda', trying llama_cpp module ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes ggml_init_cublas: found 1 CUDA devices: Device 0: NVIDIA GeForce GTX 1060 with Max-Q Design, compute capability 6.1, VMM: yes llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from llamacpp_path/mistral-7b-instruct-v0.2.Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.2 llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 17 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q5_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 4.78 GiB (5.67 BPW) llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.22 MiB ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4807.05 MiB on device 0: cudaMalloc failed: out of memory llama_model_load: error loading model: failed to allocate buffer llama_load_model_from_file: failed to load model Starting get_model: llama Failed to listen to n_gpus: No module named 'llama_cpp_cuda', trying llama_cpp module llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from llamacpp_path/mistral-7b-instruct-v0.2.Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.2 llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 17 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q5_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 4.78 GiB (5.67 BPW) llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.22 MiB ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4807.05 MiB on device 0: cudaMalloc failed: out of memory llama_model_load: error loading model: failed to allocate buffer llama_load_model_from_file: failed to load model Starting get_model: llama Failed to listen to n_gpus: No module named 'llama_cpp_cuda', trying llama_cpp module llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from llamacpp_path/mistral-7b-instruct-v0.2.Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.2 llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 17 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q5_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 4.78 GiB (5.67 BPW) llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.22 MiB ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4807.05 MiB on device 0: cudaMalloc failed: out of memory llama_model_load: error loading model: failed to allocate buffer llama_load_model_from_file: failed to load model Starting get_model: llama Failed to listen to n_gpus: No module named 'llama_cpp_cuda', trying llama_cpp module llama_model_loader: loaded meta data with 24 key-value pairs and 291 tensors from llamacpp_path/mistral-7b-instruct-v0.2.Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = llama llama_model_loader: - kv 1: general.name str = mistralai_mistral-7b-instruct-v0.2 llama_model_loader: - kv 2: llama.context_length u32 = 32768 llama_model_loader: - kv 3: llama.embedding_length u32 = 4096 llama_model_loader: - kv 4: llama.block_count u32 = 32 llama_model_loader: - kv 5: llama.feed_forward_length u32 = 14336 llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128 llama_model_loader: - kv 7: llama.attention.head_count u32 = 32 llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 8 llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010 llama_model_loader: - kv 10: llama.rope.freq_base f32 = 1000000.000000 llama_model_loader: - kv 11: general.file_type u32 = 17 llama_model_loader: - kv 12: tokenizer.ggml.model str = llama llama_model_loader: - kv 13: tokenizer.ggml.tokens arr[str,32000] = ["", "", "", "<0x00>", "<... llama_model_loader: - kv 14: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000... llama_model_loader: - kv 15: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ... llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32 = 1 llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32 = 2 llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 0 llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool = true llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool = false llama_model_loader: - kv 22: tokenizer.chat_template str = {{ bos_token }}{% for message in mess... llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 65 tensors llama_model_loader: - type q5_K: 193 tensors llama_model_loader: - type q6_K: 33 tensors llm_load_vocab: special tokens definition check successful ( 259/32000 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = llama llm_load_print_meta: vocab type = SPM llm_load_print_meta: n_vocab = 32000 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 32768 llm_load_print_meta: n_embd = 4096 llm_load_print_meta: n_head = 32 llm_load_print_meta: n_head_kv = 8 llm_load_print_meta: n_layer = 32 llm_load_print_meta: n_rot = 128 llm_load_print_meta: n_embd_head_k = 128 llm_load_print_meta: n_embd_head_v = 128 llm_load_print_meta: n_gqa = 4 llm_load_print_meta: n_embd_k_gqa = 1024 llm_load_print_meta: n_embd_v_gqa = 1024 llm_load_print_meta: f_norm_eps = 0.0e+00 llm_load_print_meta: f_norm_rms_eps = 1.0e-05 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: n_ff = 14336 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: pooling type = 0 llm_load_print_meta: rope type = 0 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 1000000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 32768 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 7B llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 7.24 B llm_load_print_meta: model size = 4.78 GiB (5.67 BPW) llm_load_print_meta: general.name = mistralai_mistral-7b-instruct-v0.2 llm_load_print_meta: BOS token = 1 '' llm_load_print_meta: EOS token = 2 '' llm_load_print_meta: UNK token = 0 '' llm_load_print_meta: PAD token = 0 '' llm_load_print_meta: LF token = 13 '<0x0A>' llm_load_tensors: ggml ctx size = 0.22 MiB ggml_backend_cuda_buffer_type_alloc_buffer: allocating 4807.05 MiB on device 0: cudaMalloc failed: out of memory llama_model_load: error loading model: failed to allocate buffer llama_load_model_from_file: failed to load model Traceback (most recent call last): File "/home/ashok/h2ogpt/generate.py", line 20, in entrypoint_main() File "/home/ashok/h2ogpt/generate.py", line 16, in entrypoint_main H2O_Fire(main) File "/home/ashok/h2ogpt/src/utils.py", line 73, in H2O_Fire fire.Fire(component=component, command=args) File "/home/ashok/anaconda3/lib/python3.11/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ashok/anaconda3/lib/python3.11/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( ^^^^^^^^^^^^^^^^^^^^ File "/home/ashok/anaconda3/lib/python3.11/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, kwargs) ^^^^^^^^^^^^^^^^^^^^^^ File "/home/ashok/h2ogpt/src/gen.py", line 2293, in main model0, tokenizer0, device = get_model_retry(reward_type=False, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ashok/h2ogpt/src/gen.py", line 2652, in get_model_retry model1, tokenizer1, device1 = get_model(kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/ashok/h2ogpt/src/gen.py", line 3303, in get_model model, tokenizer_llamacpp, device = get_model_tokenizer_gpt4all(base_model, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ashok/h2ogpt/src/gpt4all_llm.py", line 34, in get_model_tokenizer_gpt4all model, tokenizer, redo, max_seq_len = get_llm_gpt4all(llama_kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ashok/h2ogpt/src/gpt4all_llm.py", line 203, in get_llm_gpt4all llm = cls(model_kwargs) ^^^^^^^^^^^^^^^^^^^ File "/home/ashok/anaconda3/lib/python3.11/site-packages/pydantic/v1/main.py", line 341, in init raise validation_error pydantic.v1.error_wrappers.ValidationError: 1 validation error for H2OLlamaCpp root Could not load Llama model from path: llamacpp_path/mistral-7b-instruct-v0.2.Q5_K_M.gguf. Received error Failed to load model from file: llamacpp_path/mistral-7b-instruct-v0.2.Q5_K_M.gguf (type=value_error)

`

pseudotensor commented 5 months ago

It means it can't find the file or the file is corrupt.

This command works for me:

python generate.py --base_model=TheBloke/Mistral-7B-Instruct-v0.2-GGUF --prompt_type=mistral --max_seq_len=4096

that you shared.

I deleted my llamacpp_path folder and tried again, and it downloads fine, and is then used correctly.

Maybe at some point in past you got corrupted incomplete version of the file.

Please delete the file llamacpp_path/mistral-7b-instruct-v0.2.Q5_K_M.gguf and try again. Or try to use that file with llama.cpp directly and see if that works. If it does work with llama.cpp, then I'm confused.