vectara / vectara-ingest

An open source framework to crawl data sources and ingest into Vectara
https://vectara.com
Apache License 2.0
120 stars 48 forks source link

Table processing option #60

Closed ofermend closed 7 months ago

ofermend commented 8 months ago

This PR adds an optional step for processing tables in PDF documents before ingestion. This is a relatively common approach in LangChain (e.g. https://github.com/langchain-ai/langchain/blob/master/cookbook/Semi_Structured_RAG.ipynb) or LlamaIndex:

  1. Extract table (using unstructured.io) contents
  2. Ask OpenAI to summarize the contents of that table as text, and ingest that text into the corpus, in addition to the other content we already ingest.

With this PR this is an option that can be turned on, and if so required an addition of OpenAI key Note that this process is relatively slow (at least on my Mac M2) and so may or may not be useful at very large scale.