h2oai / h2ogpt

Private chat with local GPT with document, images, video, etc. 100% private, Apache 2.0. Supports oLLaMa, Mixtral, llama.cpp, and more. Demo: https://gpt.h2o.ai/ https://codellama.h2o.ai/
http://h2o.ai
Apache License 2.0
11.01k stars 1.2k forks source link

Collection documents prioritization #896

Open slavag opened 9 months ago

slavag commented 9 months ago

Hi, Let's assume I have a collection of embeddings, is it possible that most recent (by some date) vectors will be prioritized then just random, in the case they have same distance ?

Thanks

pseudotensor commented 9 months ago

As-is we have no metadata filtering that user controls. But pdfs etc. often have date info, and we keep that. We also save the date time of the ingestion. In principle one can add extra filters in the filter_kwargs related code in gpt_langchain.py, but it's not done at moment. It's planned at some point to allow query to understand the metadata via query understanding, or to give more control over metadata filtering.

slavag commented 9 months ago

@pseudotensor It would be good to define how to put timestamp during ingestion, as date of the document is not alway correct, and ingestion time also. For example, one approach, put timestamp to the file name, or to define where to take it in the file. Thanks

pseudotensor commented 9 months ago

Ya most of time can't trust time in PDF. But unclear how to scale that if one had 4000 PDFs to ingest. If just one random PDF, then probably not urgent to add time stamp since just focused on that PDF. So unsure how to proceed.

Other metadata might be more useful, but same issue of how to trust. And other kinds of docs will not have same metadata, so hard to make uniformly good experience.