h2oai / h2ogpt

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://codellama.h2o.ai/
http://h2o.ai
Apache License 2.0
11.06k stars 1.2k forks source link

Add support for weaviate #145

Closed hsm207 closed 1 year ago

hsm207 commented 1 year ago

I see issue #134 mentions:

Consider https://github.com/weaviate/weaviate if want to store both vectors and objects in db -- not necessarily wanted in general, but makes db stable against needing original file locations for links. https://github.com/imartinez/privateGPT/pull/208

I'd like to contribute a PR for this

pseudotensor commented 1 year ago

@hsm207 Absolutely, please do!

hsm207 commented 1 year ago

@pseudotensor great! any pointers to get started e.g. some docs to read on how to add a new db, interfaces to implement, etc?

pseudotensor commented 1 year ago

I already have faiss and chroma. The code is very well isolated. Look at functions:

1) get_db 2) add_to_db

and look at the variable db_type that would be the string to change when calling from CLI. That just involves changing a few documentation places, which just involves searching for "chroma" or "faiss" and seeing where mentioned in readme or code commeents.

And of course you can switch the db in the test code in test_langchain_units.py.

hsm207 commented 1 year ago

@pseudotensor I see in the test_langchain_units.py file that there are only 2 tests with chroma or faiss in the name:

  1. test_qa_daidocs_db_chunk_hf_faiss
  2. test_qa_daidocs_db_chunk_hf_chroma

so, is creating test_qa_daidocs_db_chunk_hf_weaviate really enough to verify the correctness of the new vector db?

hsm207 commented 1 year ago

@pseudotensor are the tests in tests/test_langchain_units.py meant to be run on a machine with a GPU?

I ran pytest tests/test_langchain_units.py and all the test_qa tests e.g. test_qa_daidocs_db_chunk_hf_chroma failed with this reason:

Traceback (most recent call last): File "/usr/local/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker r = call_item.fn(*call_item.args, *call_item.kwargs) File "/workspaces/h2ogpt/utils.py", line 753, in _traced_func return func(args, **kwargs) File "/workspaces/h2ogpt/tests/test_langchain_units.py", line 163, in test_qa_daidocs_db_chunk_hf_chroma check_ret(ret) File "/workspaces/h2ogpt/tests/test_langchain_units.py", line 68, in check_ret for ret1 in ret: File "/workspaces/h2ogpt/gpt_langchain.py", line 935, in _run_qa_db llm, model_name, streamer, prompt_type_out = get_llm(use_openai_model=use_openai_model, model_name=model_name, File "/workspaces/h2ogpt/gpt_langchain.py", line 171, in get_llm model = AutoModelForCausalLM.from_pretrained(model_name, File "/home/vscode/.local/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 471, in from_pretrained return model_class.from_pretrained( File "/home/vscode/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2740, in from_pretrained raise ValueError( ValueError: Some modules are dispatched on the CPU or the disk. Make sure you have enough GPU RAM to fit the quantized model. If you want to dispatch the model on the CPU or the disk while keeping these modules in 32-bit, you need to set load_in_8bit_fp32_cpu_offload=True and pass a custom device_map to from_pretrained. Check https://huggingface.co/docs/transformers/main/en/main_classes/quantization#offload-between-cpu-and-gpu for more details.

pseudotensor commented 1 year ago

Yes, that would work, to just add another test.

The _run_qa_db by default makes a "model" if model is None. So you can choose any model you'd like. This is a unit test, not a full integration test, it just happens to make a model if you didn't.

So you can do:

@wrap_test_forked
def test_qa_wiki_db_chunk_hf_weaviate():

    from gpt4all_llm import get_model_tokenizer_gpt4all
    model_name = 'llama'
    model, tokenizer, device = get_model_tokenizer_gpt4all(model_name)

    from gpt_langchain import _run_qa_db
    query = "What are the main differences between Linux and Windows?"
    # chunk_size is chars for each of k=4 chunks
    ret = _run_qa_db(query=query, use_openai_model=False, use_openai_embedding=False, text_limit=None, chunk=True,
                     chunk_size=128 * 1,  # characters, and if k=4, then 4*4*128 = 2048 chars ~ 512 tokens
                     langchain_mode='wiki',
                     db_type='weaviate',
                     prompt_type='wizard2',
                     model_name=model_name, model=model, tokenizer=tokenizer,
                     )
    check_ret(ret)
pseudotensor commented 1 year ago

Here's a draft PR: https://github.com/h2oai/h2ogpt/pull/215 . Right now the test runs but uses chroma.

hsm207 commented 1 year ago

@pseudotensor thanks for the draft PR. The test can be run on a CPU too, right? If not, let me know what kind of GPU you develop this project on.

pseudotensor commented 1 year ago

@hsm207 correct. It requires still one has downloaded that wizardlm model as described in the readme.md. And by using "wiki" it uses wikipedia API requiring internet. We can add more code to test offline, but this should be fine for now.

h2ollm) jon@pseudotensor:~/h2ogpt$ CUDA_VISIBLE_DEVICES= pytest -s -v tests/test_langchain_units.py::test_qa_wiki_db_chunk_hf_weaviate
============================================================================================================== test session starts ==============================================================================================================
platform linux -- Python 3.10.11, pytest-7.2.2, pluggy-1.0.0 -- /home/jon/miniconda3/envs/h2ollm/bin/python
cachedir: .pytest_cache
rootdir: /home/jon/h2ogpt
plugins: anyio-3.6.2, xdist-3.2.1
collected 1 item                                                                                                                                                                                                                                

tests/test_langchain_units.py::test_qa_wiki_db_chunk_hf_weaviate llama.cpp: loading model from WizardLM-7B-uncensored.ggmlv3.q8_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 1792
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 7 (mostly Q8_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =    0.07 MB
llama_model_load_internal: mem required  = 8620.72 MB (+ 1026.00 MB per state)
.
llama_init_from_file: kv self size  =  896.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
('\n\nThe main differences between Linux and Windows are:\n\n1. Licensing: Linux is licensed under the GPL (General Public License), which allows for free use and modification of the software. Windows is licensed under a variety of proprietary software licenses, which restrict its use and modification.\n\n2. Kernel: Linux uses a monolithic kernel, which is the core component of the operating system that manages system resources. Windows uses a microkernel, which is a smaller, more modular component that manages system resources.\n\n3. File system: Linux uses a hierarchical file system, while Windows uses a flat file system. This means that Linux organizes files and directories in a tree-like structure, while Windows organizes them in a linear fashion.\n\n4. User interface: Linux has a variety of user interfaces (UIs) that can be used, including GNOME, KDE, and Xfce. Windows has a consistent UI across all versions, but it is not customizable like Linux UIs.\n\n5. Security: Linux is generally considered more secure than Windows because it is open-source and has a large community of developers who work to identify and fix security\n\nSources [Score | Link]:<p><ul><li>0.69 | <a href="https://en.wikipedia.org/wiki/Linux" target="_blank"  rel="noopener noreferrer">https://en.wikipedia.org/wiki/Linux</a></li></ul></p>End Sources<p>', '\nSources [Score | Link]:<p><ul><li>0.69 | <a href="https://en.wikipedia.org/wiki/Linux" target="_blank"  rel="noopener noreferrer">https://en.wikipedia.org/wiki/Linux</a></li></ul></p>End Sources<p>')
PASSED

=============================================================================================================== warnings summary ================================================================================================================
tests/test_langchain_units.py:542
  /home/jon/h2ogpt/tests/test_langchain_units.py:542: DeprecationWarning: invalid escape sequence '\m'
    rtf_content = """

../miniconda3/envs/h2ollm/lib/python3.10/site-packages/pkg_resources/__init__.py:121
  /home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/pkg_resources/__init__.py:121: DeprecationWarning: pkg_resources is deprecated as an API
    warnings.warn("pkg_resources is deprecated as an API", DeprecationWarning)

../miniconda3/envs/h2ollm/lib/python3.10/site-packages/pkg_resources/__init__.py:2870
  /home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/pkg_resources/__init__.py:2870: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('google')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

../miniconda3/envs/h2ollm/lib/python3.10/site-packages/pkg_resources/__init__.py:2870
  /home/jon/miniconda3/envs/h2ollm/lib/python3.10/site-packages/pkg_resources/__init__.py:2870: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('mpl_toolkits')`.
  Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
    declare_namespace(pkg)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=================================================================================================== 1 passed, 4 warnings in 83.33s (0:01:23) ====================================================================================================
(h2ollm) jon@pseudotensor:~/h2ogpt$ 
hsm207 commented 1 year ago

@pseudotensor why is the db_type in make_db_main() hardcoded with chroma?

it is also hardcoded here

what do you think of making it a parameter of the make_db_main function instead with chroma set as the default?

pseudotensor commented 1 year ago

@hsm207 Yes, good point. It was hard-coded vs. FAISS because only chroma naturally persists the database, so if persist db directory exists, it was only chroma before.