Closed dlippold closed 1 year ago
I recommend llama.cpp based GGML models instead from TheBloke. For running any such models, see https://github.com/h2oai/h2ogpt/blob/main/docs/FAQ.md#adding-models . Re-open if you have questions.
What do you mean with I recommend llama.cpp based GGML models instead from TheBloke? Does that mean that currently the falcon model cannot be used or that I should use the falcon model with a different model version (e.g. v3)?
Becase of the name of the file ggml-model-gpt4all-falcon-q4_0.bin
I think it is a GGML file. But reading the FAQ from your link I suppose with "GGML model" you mean " GGML model v3" (cite from the FAQ: GGML v3 quantized models are supported).
Therefore I downloaded the model file h2ogpt-gm-oasst1-en-2048-falcon-7b-v3.ggccv1.q4_0.bin
from the page https://huggingface.co/TheBloke/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3-GGML (i.e. GGML v3).
When I execute the command
python generate.py --score_model=None --share=False --local_files_only=True --base_model=llama --model_path_llama=/home/h2ouser/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3.ggccv1.q4_0.bin --max_seq_len 2048
I get the following output:
Auto set langchain_mode=LLM. Could use MyData instead. To allow UserData to pull files from disk, set user_path or langchain_mode_paths, and ensure allow_upload_to_user_data=True No GPUs detected Using Model llama Prep: persist_directory=db_dir_UserData exists, using Starting get_model: llama /home/h2ouser/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:995: FutureWarning: The
use_auth_token
argument is deprecated and will be removed in v5 of Transformers. warnings.warn( llama.cpp: loading model from /home/h2ouser/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3.ggccv1.q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 65024 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4544 llama_model_load_internal: n_mult = 71 llama_model_load_internal: n_head = 1 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 7 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 12141 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.05 MB error loading model: llama.cpp: tensor 'tok_embeddings.weight' is missing from model llama_load_model_from_file: failed to load model Traceback (most recent call last): File "/home/h2ouser/h2ogpt/generate.py", line 16, inentrypoint_main() File "/home/h2ouser/h2ogpt/generate.py", line 12, in entrypoint_main H2O_Fire(main) File "/home/h2ouser/h2ogpt/src/utils.py", line 59, in H2O_Fire fire.Fire(component=component, command=args) File "/home/h2ouser/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/h2ouser/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/h2ouser/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, kwargs) File "/home/h2ouser/h2ogpt/src/gen.py", line 1124, in main model0, tokenizer0, device = get_model(reward_type=False, File "/home/h2ouser/h2ogpt/src/gen.py", line 1579, in get_model model, tokenizer, device = get_model_tokenizer_gpt4all(base_model, n_jobs=n_jobs, File "/home/h2ouser/h2ogpt/src/gpt4all_llm.py", line 16, in get_model_tokenizer_gpt4all model = get_llm_gpt4all(model_name, model=None, File "/home/h2ouser/h2ogpt/src/gpt4all_llm.py", line 152, in get_llm_gpt4all llm = cls(model_kwargs) File "/home/h2ouser/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/langchain/load/serializable.py", line 74, in init super().init(**kwargs) File "pydantic/main.py", line 341, in pydantic.main.BaseModel.init pydantic.error_wrappers.ValidationError: 1 validation error for H2OLlamaCpp root Could not load Llama model from path: /home/h2ouser/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3.ggccv1.q4_0.bin. Received error (type=value_error) Exception ignored in: <function Llama.del at 0x7fc33ef27760> Traceback (most recent call last): File "/home/h2ouser/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/llama_cpp/llama.py", line 1502, in del if self.ctx is not None: AttributeError: 'Llama' object has no attribute 'ctx'
What I can do to use a falcon model from TheBloke?
I see the same error with that particular model file. TheBloke must have not made it correctly.
If you try a llama based file like in the readme.md (and many others I've tried) there is no problem.
(h2ogpt) jon@pseudotensor:~/h2ogpt$ python generate.py --score_model=None --share=False --local_files_only=True --base_model=llama --model_path_llama=llama-2-7b-chat.ggmlv3.q8_0.bin --max_seq_len 2048
Auto set langchain_mode=LLM. Could use MyData instead. To allow UserData to pull files from disk, set user_path or langchain_mode_paths, and ensure allow_upload_to_user_data=True
Using Model llama
Prep: persist_directory=db_dir_UserData exists, using
Starting get_model: llama
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:995: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
llama.cpp: loading model from llama-2-7b-chat.ggmlv3.q8_0.bin
llama_model_load_internal: format = ggjt v3 (latest)
llama_model_load_internal: n_vocab = 32000
llama_model_load_internal: n_ctx = 2048
llama_model_load_internal: n_embd = 4096
llama_model_load_internal: n_mult = 256
llama_model_load_internal: n_head = 32
llama_model_load_internal: n_layer = 32
llama_model_load_internal: n_rot = 128
llama_model_load_internal: freq_base = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype = 7 (mostly Q8_0)
llama_model_load_internal: n_ff = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size = 0.08 MB
llama_model_load_internal: mem required = 8500.72 MB (+ 1026.00 MB per state)
warning: failed to mlock 139264000-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MLOCK ('ulimit -l' as root).
llama_new_context_with_model: kv self size = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
Model {'base_model': 'llama', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'llama2', 'prompt_dict': {'promptA': '', 'promptB': '', 'PreInstruct': '<s>[INST] ', 'PreInput': None, 'PreResponse': '[/INST]', 'terminate_response': ['[INST]', '</s>'], 'chat_sep': ' ', 'chat_turn_sep': ' </s>', 'humanstr': '[INST]', 'botstr': '[/INST]', 'generates_leading_space': False, 'system_prompt': None}, 'load_8bit': False, 'load_4bit': False, 'low_bit_mode': 1, 'load_half': True, 'load_gptq': '', 'load_exllama': False, 'use_safetensors': False, 'revision': None, 'use_gpu_id': True, 'gpu_id': 0, 'compile_model': True, 'use_cache': None, 'llamacpp_dict': {'n_gpu_layers': 100, 'use_mlock': True, 'n_batch': 1024, 'n_gqa': 0, 'model_path_llama': 'llama-2-7b-chat.ggmlv3.q8_0.bin', 'model_name_gptj': 'ggml-gpt4all-j-v1.3-groovy.bin', 'model_name_gpt4all_llama': 'ggml-wizardLM-7B.q4_2.bin', 'model_name_exllama_if_no_config': 'TheBloke/Nous-Hermes-Llama2-GPTQ'}, 'model_path_llama': 'llama-2-7b-chat.ggmlv3.q8_0.bin', 'model_name_gptj': 'ggml-gpt4all-j-v1.3-groovy.bin', 'model_name_gpt4all_llama': 'ggml-wizardLM-7B.q4_2.bin', 'model_name_exllama_if_no_config': 'TheBloke/Nous-Hermes-Llama2-GPTQ'}
load INSTRUCTOR_Transformer
max_seq_length 512
Running on local URL: http://0.0.0.0:7860
To create a public link, set `share=True` in `launch()`.
Started Gradio Server and/or GUI: server_name: 0.0.0.0 port: None
I downloaded the model
ggml-model-gpt4all-falcon-q4_0.bin
from the GPT4All page (https://gpt4all.io/index.html). How can I use that model? Which option forgenerate.py
and which values for these options I have to use?If that model is not usable, is there an other falcon model which I can download from https://huggingface.co/TheBloke ?
I want to use the falcon model and don't want to use a model of the Llama family. And I want to manually download the model for offline mode.