Integrating H2OGPT in multitenance application.

Eliyas commented 3 months ago

We have a multi-tenant application and are integrating H2OGPT to serve various clients. Each client may have hundreds of users.

1. Our goal is to store and ingest user, client, and permission-based documents efficiently within H2OGPT. Is there a method to manage and search only in documents permitted for each user without duplicating them? I have explored using collections in H2OGPT, where each user would have a separate collection. However, duplicating documents across multiple collections is not ideal for us. Administrators of each client should have access to all documents, while other users should only access documents based on their permissions.

2. How to add custom metadata to a document? I am thinking of filtering documents based on custom metadata like roles, permission, and client names. I saw that chromaDB has a metadata filter available in the query.

results = collection.query(
   query_texts=["This is a query document"],
   n_results=2,
   where={"metadata_field": "is_equal_to_this"}
)

But not sure how to add custom metadata into a document and pass metadata filter in summary or query API in H2OGPT.

3. Do we need to restart the AI server to create a new collection and DB?

For creating user collection now I am doing this. python src/make_db.py --user_path=clients/user1 --collection_name=user1 --langchain_type=personal --hf_embedding_model=hkunlp/instructor-large --persist_directory=users/test/db_dir_user1

After python generate.py --base_model=TheBloke/zephyr-7B-beta-GGUF --tokenizer_base_model=zephyr/zephyr-7B-beta --prompt_type=zephyr --max_seq_len=4096 --langchain_mode='UserData' --langchain_modes=['UserData', 'img', 'live', 'user1'] --langchain_mode_paths="{'UserData':'user_path','img':'clients/img','live':'clients/live','user1':'clients/user1'}" --langchain_mode_types='{'UserData':'personal','img':'personal','live':'personal','user1':'personal'}' --save_dir=saveDir --verbose=True --system_prompt='auto'

Do we need to add and execute the script to define user collections in langchain_modes, langchain_mode_paths, and langchain_mode_types every time we restart the server, or is it a one-time setup? If we have 100 users, do we need to include details for each user's collection in the script?

llmwesee commented 3 months ago

I'm also curious about the implementation details?

pseudotensor commented 3 months ago

@Eliyas Sorry for the delay.

1a. The simplest way to implement an efficient document ingestion is via caching of the embedding for a given input. That is relatively easy.

1b. Permission based access is automatic if using personal collections and auth access to the user.

h2oGPT automatically adds meta data to this and its searchable as AND or OR logical operation, but there is no way to add additional meta data. However, if the document was parsed and had metadata (e.g. PDF) and you add metadata_in_context='all' or pass a list of keys to metadata_in_context, then we will use that. So the issue just passes to the PDF having the correct metadata and you choosing which to use.
No need to restart. If created outside server with make_db, then any user can add that collection by name and see it. E.g. for specific users, one would follow this: https://github.com/h2oai/h2ogpt/blob/main/docs/README_LangChain.md#personal-collections-with-make_db

h2oai / h2ogpt

Integrating H2OGPT in multitenance application. #1766