Question: Usage of GPT4All model

dlippold commented 1 year ago

I downloaded the model ggml-model-gpt4all-falcon-q4_0.bin from the GPT4All page (https://gpt4all.io/index.html). How can I use that model? Which option for generate.py and which values for these options I have to use?

If that model is not usable, is there an other falcon model which I can download from https://huggingface.co/TheBloke ?

I want to use the falcon model and don't want to use a model of the Llama family. And I want to manually download the model for offline mode.

pseudotensor commented 1 year ago

I recommend llama.cpp based GGML models instead from TheBloke. For running any such models, see https://github.com/h2oai/h2ogpt/blob/main/docs/FAQ.md#adding-models . Re-open if you have questions.

dlippold commented 1 year ago

What do you mean with I recommend llama.cpp based GGML models instead from TheBloke? Does that mean that currently the falcon model cannot be used or that I should use the falcon model with a different model version (e.g. v3)?

Becase of the name of the file ggml-model-gpt4all-falcon-q4_0.bin I think it is a GGML file. But reading the FAQ from your link I suppose with "GGML model" you mean " GGML model v3" (cite from the FAQ: GGML v3 quantized models are supported).

Therefore I downloaded the model file h2ogpt-gm-oasst1-en-2048-falcon-7b-v3.ggccv1.q4_0.bin from the page https://huggingface.co/TheBloke/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3-GGML (i.e. GGML v3).

When I execute the command python generate.py --score_model=None --share=False --local_files_only=True --base_model=llama --model_path_llama=/home/h2ouser/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3.ggccv1.q4_0.bin --max_seq_len 2048 I get the following output:

Auto set langchain_mode=LLM. Could use MyData instead. To allow UserData to pull files from disk, set user_path or langchain_mode_paths, and ensure allow_upload_to_user_data=True No GPUs detected Using Model llama Prep: persist_directory=db_dir_UserData exists, using Starting get_model: llama /home/h2ouser/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:995: FutureWarning: The use_auth_token argument is deprecated and will be removed in v5 of Transformers. warnings.warn( llama.cpp: loading model from /home/h2ouser/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3.ggccv1.q4_0.bin llama_model_load_internal: format = ggjt v3 (latest) llama_model_load_internal: n_vocab = 65024 llama_model_load_internal: n_ctx = 2048 llama_model_load_internal: n_embd = 4544 llama_model_load_internal: n_mult = 71 llama_model_load_internal: n_head = 1 llama_model_load_internal: n_layer = 32 llama_model_load_internal: n_rot = 7 llama_model_load_internal: freq_base = 10000.0 llama_model_load_internal: freq_scale = 1 llama_model_load_internal: ftype = 2 (mostly Q4_0) llama_model_load_internal: n_ff = 12141 llama_model_load_internal: model size = 7B llama_model_load_internal: ggml ctx size = 0.05 MB error loading model: llama.cpp: tensor 'tok_embeddings.weight' is missing from model llama_load_model_from_file: failed to load model Traceback (most recent call last): File "/home/h2ouser/h2ogpt/generate.py", line 16, in entrypoint_main() File "/home/h2ouser/h2ogpt/generate.py", line 12, in entrypoint_main H2O_Fire(main) File "/home/h2ouser/h2ogpt/src/utils.py", line 59, in H2O_Fire fire.Fire(component=component, command=args) File "/home/h2ouser/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/fire/core.py", line 141, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/h2ouser/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/h2ouser/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace component = fn(*varargs, kwargs) File "/home/h2ouser/h2ogpt/src/gen.py", line 1124, in main model0, tokenizer0, device = get_model(reward_type=False, File "/home/h2ouser/h2ogpt/src/gen.py", line 1579, in get_model model, tokenizer, device = get_model_tokenizer_gpt4all(base_model, n_jobs=n_jobs, File "/home/h2ouser/h2ogpt/src/gpt4all_llm.py", line 16, in get_model_tokenizer_gpt4all model = get_llm_gpt4all(model_name, model=None, File "/home/h2ouser/h2ogpt/src/gpt4all_llm.py", line 152, in get_llm_gpt4all llm = cls(model_kwargs) File "/home/h2ouser/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/langchain/load/serializable.py", line 74, in init super().init(**kwargs) File "pydantic/main.py", line 341, in pydantic.main.BaseModel.init pydantic.error_wrappers.ValidationError: 1 validation error for H2OLlamaCpp root Could not load Llama model from path: /home/h2ouser/h2ogpt-gm-oasst1-en-2048-falcon-7b-v3.ggccv1.q4_0.bin. Received error (type=value_error) Exception ignored in: <function Llama.del at 0x7fc33ef27760> Traceback (most recent call last): File "/home/h2ouser/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/llama_cpp/llama.py", line 1502, in del if self.ctx is not None: AttributeError: 'Llama' object has no attribute 'ctx'

What I can do to use a falcon model from TheBloke?

pseudotensor commented 1 year ago

I see the same error with that particular model file. TheBloke must have not made it correctly.

If you try a llama based file like in the readme.md (and many others I've tried) there is no problem.

(h2ogpt) jon@pseudotensor:~/h2ogpt$ python generate.py --score_model=None --share=False --local_files_only=True --base_model=llama --model_path_llama=llama-2-7b-chat.ggmlv3.q8_0.bin --max_seq_len 2048                                                     
Auto set langchain_mode=LLM.  Could use MyData instead.  To allow UserData to pull files from disk, set user_path or langchain_mode_paths, and ensure allow_upload_to_user_data=True
Using Model llama
Prep: persist_directory=db_dir_UserData exists, using
Starting get_model: llama 
/home/jon/miniconda3/envs/h2ogpt/lib/python3.10/site-packages/transformers/models/auto/configuration_auto.py:995: FutureWarning: The `use_auth_token` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
llama.cpp: loading model from llama-2-7b-chat.ggmlv3.q8_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.08 MB
llama_model_load_internal: mem required  = 8500.72 MB (+ 1026.00 MB per state)
warning: failed to mlock 139264000-byte buffer (after previously locking 0 bytes): Cannot allocate memory
Try increasing RLIMIT_MLOCK ('ulimit -l' as root).
llama_new_context_with_model: kv self size  = 1024.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
Model {'base_model': 'llama', 'tokenizer_base_model': '', 'lora_weights': '', 'inference_server': '', 'prompt_type': 'llama2', 'prompt_dict': {'promptA': '', 'promptB': '', 'PreInstruct': '<s>[INST] ', 'PreInput': None, 'PreResponse': '[/INST]', 'terminate_response': ['[INST]', '</s>'], 'chat_sep': ' ', 'chat_turn_sep': ' </s>', 'humanstr': '[INST]', 'botstr': '[/INST]', 'generates_leading_space': False, 'system_prompt': None}, 'load_8bit': False, 'load_4bit': False, 'low_bit_mode': 1, 'load_half': True, 'load_gptq': '', 'load_exllama': False, 'use_safetensors': False, 'revision': None, 'use_gpu_id': True, 'gpu_id': 0, 'compile_model': True, 'use_cache': None, 'llamacpp_dict': {'n_gpu_layers': 100, 'use_mlock': True, 'n_batch': 1024, 'n_gqa': 0, 'model_path_llama': 'llama-2-7b-chat.ggmlv3.q8_0.bin', 'model_name_gptj': 'ggml-gpt4all-j-v1.3-groovy.bin', 'model_name_gpt4all_llama': 'ggml-wizardLM-7B.q4_2.bin', 'model_name_exllama_if_no_config': 'TheBloke/Nous-Hermes-Llama2-GPTQ'}, 'model_path_llama': 'llama-2-7b-chat.ggmlv3.q8_0.bin', 'model_name_gptj': 'ggml-gpt4all-j-v1.3-groovy.bin', 'model_name_gpt4all_llama': 'ggml-wizardLM-7B.q4_2.bin', 'model_name_exllama_if_no_config': 'TheBloke/Nous-Hermes-Llama2-GPTQ'}
load INSTRUCTOR_Transformer
max_seq_length  512
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Started Gradio Server and/or GUI: server_name: 0.0.0.0 port: None

h2oai / h2ogpt

Question: Usage of GPT4All model #798