h2oai / h2ogpt

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://codellama.h2o.ai/
http://h2o.ai
Apache License 2.0
10.94k stars 1.2k forks source link

can't add personal data db/collection to auth.json #1684

Closed rxng closed 5 days ago

rxng commented 2 weeks ago

According to the instructions, we can add a make_db.py database to auth.json , but does not specify exactly how to do this.

To make a new one for the user, fill `user_path_jon` with documents (can be soft or hard linked to avoid dups across multiple users), do:
```bash
python src/make_db.py --user_path=gptdocsdb/jon--collection_name=JonData --langchain_type=personal --hf_embedding_model=hkunlp/instructor-large --persist_directory=users/jon/db_dir_JonData

Then you'll have:

(h2ogpt) jon@pseudotensor:~/h2ogpt$ ls -alrt users/jon/db_dir_JonData/
total 264
drwx------ 13 jon jon   4096 Apr 16 12:28 ../
drwx------  2 jon jon   4096 Apr 16 12:28 d7ccacb6-93fe-4380-9340-b7f5edffb655/
-rw-------  1 jon jon 249856 Apr 16 12:28 chroma.sqlite3
-rw-------  1 jon jon     41 Apr 16 12:28 embed_info
drwx------  3 jon jon   4096 Apr 16 12:28 ./

You can add that database to the auth.json for their entry if using auth.json type file, and they will see when they login.


h2ogpt is being run like so and everything works well except it does not load the correct collection for the user 
`python generate.py --base_model=mistral-7b-instruct-v0.2.Q8_0.gguf --score_model=None --prompt_type=instruct --auth_access=closed --auth=auth.json --guest_name='' --auth_freeze`

I have tried the following by adding db parameters but it does not work. 

{ "jon": { "password": "jon1306", "userid": "acb8fef1a77d122b5e12b261202ada7a", "selection_docs_state": { "langchain_modes": [ "JonData", "LLM", "Disabled" ], "langchain_mode_types": { "JonData": "personal" } }, "dbs": "users/jon/db_dir_JonData", "load_db_if_exists": "users/jon/db_dir_JonData" } }



How do we make it such that when user logs in, their  collection JonData is automatically added? 
Or, Any way to simply specify a per user user_path? that would be easiest.
pseudotensor commented 2 weeks ago

If you are trying this for shared collection, did you try the CLI options?

https://github.com/h2oai/h2ogpt/blob/main/docs/README_LangChain.md#multiple-embeddings-and-sources

i.e.

python generate.py --model_lock="[{'base_model': 'llama', 'model_path_llama': 'Phi-3-mini-4k-instruct-q4.gguf', 'tokenizer_base_model': 'microsoft/Phi-3-mini-4k-instruct'}]" --use_auth_token=$HUGGING_FACE_HUB_TOKEN --langchain_modes="['UserData', 'MyData', 'UserData2']"

Would show all users those 2 by default.

Even if a user logs in that already had a db entry, they will be forced to see those CLI ones.

If the system is online, without restarting, there's currently no way to add to all users at once with e.g. some kind of global user added settings. Is that what you are trying to achieve?

pseudotensor commented 2 weeks ago

For personal collections, there's no CLI options for that, it's only in the db/json file. By default sqlite3 db is used in newer h2oGPT to address speed issues with json, so one would have to edit the db using operations like in the src/db_utils.py.

I'll think about how to handle this better, probably adding an option to add things via the admin page is best. Would that work for you?

rxng commented 2 weeks ago

thanks for your quick response! Maybe I was confusing in my explanation. I was trying to achieve having a user logging in and then their own collection would be automatically loaded for them.

However, I tried every single parameter and just found a way to do it via the auth.json file, by adding the line "langchain_mode": "JonData", above the selection_docs_state entry, like so

"langchain_mode": "JonData",
    "selection_docs_state": {

The only question I have is, if we wanted to then add more documents to the collection via make_db.py , would we then have to restart the entire instance of h2ogpt to automatically use the updated collection?

It would definitely be great if there was an admin page where these things could easily be managed :)

pseudotensor commented 5 days ago

image

image

image

image

rxng commented 5 days ago

image

image

image

image

that's so amazing @pseudotensor !!

pseudotensor commented 5 days ago

Note that if you have an auth file that is .json, just pass to CLI that it is now .db and we'll migrate it to .db format that is required for this control

https://github.com/h2oai/h2ogpt/blob/3498b03fcd814458cea7e319039c42df48b1231a/src/db_utils.py#L80-L101