run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
37.02k stars 5.31k forks source link

[Question]: Ingesting Powerpoints with graphs and Images #16693

Open martinb-ai opened 1 month ago

martinb-ai commented 1 month ago

Question Validation

Question

HI there!

I am curious on how to handle PowerPoints that contain images and graphs besides texts. What is the most efficient way to embed the information.

I know we can use a multimodal model that can analyze the images and provide descriptions of what it sees, but is there a better more efficient way to do this in llama-index?

Or is it better to convert the pptx into a pdf and ingest it that way? Do PDF images get handled better?

Thanks!

dosubot[bot] commented 1 month ago

Hey @martinb-ai! 👋 I'm here to help you with any bugs, questions, or contributions you have while you wait for a human maintainer. Let's tackle this together!

I found a similar unsolved discussion regarding parsing multimedia content of PowerPoint files [1]. Additionally, LlamaIndex supports integration with multimodal models for processing images and graphs through its integration with various multimodal language models, including those from Hugging Face, Ollama, and OpenAI. This integration enables advanced multimodal capabilities such as image captioning and visual question answering [2].

For handling PowerPoints with images and graphs, you can use the PptxReader class in LlamaIndex, which extracts text and captions images from .pptx files. It uses a pre-trained model to generate captions for images, which involves temporarily saving the image and processing it with a vision encoder-decoder model. However, for other types of multimedia content like graphs, additional handling and models would be required, which are not currently implemented in the provided code [3].

If you are looking for a more efficient way, you might consider creating separate indexes for each type of document or content, as each modality requires a specific embedding model. This approach ensures optimal performance when searching over a knowledge base with documents requiring different kinds of embeddings, including multimodal ones [4].

To continue talking to Dosu, mention @dosu.

logan-markewich commented 1 month ago

Have you seen this example? https://github.com/run-llama/llama_parse/blob/main/examples/multimodal/multimodal_rag_slide_deck.ipynb

martinb-ai commented 1 month ago

@logan-markewich Hi there! I read through it but I am curious... Is there a way to not have to use Llama-parse?