tensorlakeai / indexify

A realtime and indexing and structured extraction engine for Unstructured Data to build Generative AI Applications
https://docs.getindexify.ai
Apache License 2.0
782 stars 91 forks source link

Add Data Transformers to Data Repository #107

Open diptanu opened 11 months ago

diptanu commented 11 months ago

Content is extracted when a developer binds an extractor to a data repository. As new content lands the extractors are applied on the content and the derived information is written to indexes.

Extractors are responsible for chunking content, for ex splitting text in a document before they are embedded. Certain extractors like NER and Embedding extractors could be sharing the same chunked content since the context length of the underlying models of the extractors is limited. Currently these extractors duplicate the text splitting work.

The solution would be to introduce a high level transformer concept which can apply algorithms content and store the intermediate representation such as - splitting text into smaller chunks, extracting log mel features from audio files (as most speech models use log mel features), applying filters to images, etc. The intermediate/processed content will live in buffers - a logical storage abstraction that will trigger the extractors when data lands in them.

So it will look some thing like - Content -> Transformers -> Buffer -> Extractors -> Index (continuosly)

yenicelik commented 10 months ago

could buffers be a a persistent queue like kafka or redis, i.e. serialized through protobuf? or were you thinking something more structured?