Open slavag opened 9 months ago
As-is we have no metadata filtering that user controls. But pdfs etc. often have date info, and we keep that. We also save the date time of the ingestion. In principle one can add extra filters in the filter_kwargs related code in gpt_langchain.py, but it's not done at moment. It's planned at some point to allow query to understand the metadata via query understanding, or to give more control over metadata filtering.
@pseudotensor It would be good to define how to put timestamp during ingestion, as date of the document is not alway correct, and ingestion time also. For example, one approach, put timestamp to the file name, or to define where to take it in the file. Thanks
Ya most of time can't trust time in PDF. But unclear how to scale that if one had 4000 PDFs to ingest. If just one random PDF, then probably not urgent to add time stamp since just focused on that PDF. So unsure how to proceed.
Other metadata might be more useful, but same issue of how to trust. And other kinds of docs will not have same metadata, so hard to make uniformly good experience.
Hi, Let's assume I have a collection of embeddings, is it possible that most recent (by some date) vectors will be prioritized then just random, in the case they have same distance ?
Thanks