run-llama / llama-hub

A library of data loaders for LLMs made by the community -- to be used with LlamaIndex and/or LangChain
https://llamahub.ai/
MIT License
3.44k stars 731 forks source link

[Feature Request] Also extract `ImageDocument` nodes from PDFs #326

Open jon-chuang opened 1 year ago

jon-chuang commented 1 year ago

Start with pdf_miner - it can already extract images.

There should be some discussion about the plan for extracting multimedia - the surrounding context as well as metadata could make a difference in making that piece of media useful. I will do some research on existing literature.

Starting with these simple connector lego pieces is good, but I think we need to focus down onto some key workflows (e.g. research paper, business report, chatbot from company website) and optimize those out with some good defaults - based on reading of literature and others' applications and our own experiments.

I am particularly interested in an extraction workflow that leverages an agent. For instance, how should one decide whether to include the output from a natural image captioning model v.s. a tabular model in the context? How can one avoid evaluating multiple models on a single image based on guessing the image type from the surrounding context, and selecting for the most relevant model?

jon-chuang commented 1 year ago

Anw, if this project goes well, perhaps it warrants a rigorous investigation (and accompanying a research paper/blog) on to what degree certain techniques increase recall on a multi-media knowledge base.

jon-chuang commented 1 year ago

As a first approximation, automatic dispatch may be too hard. A UI where the human can easily batch review and specify the model dispatch based on the media content would be desirable.

Generally, the paradigm is - get something working nearly instantly, then have the tools to inspect, analyse and refine.