[Question]: Analyze a image with GPT4 Vision?

WAS-PlaiLabs commented 7 months ago

Question Validation

[X] I have searched both the documentation and discord for an answer.

Question

I seem to remember being able to use SimpleDirectoryReader to describe images as tags just fine, until now. Now I see this whole quite elaborate MultiModal thing just to analyze and image when GPT-4 Vision is already setup to accept a image buffer to take a look at. What happened?

Similarly, I was able to use GPT4 Vision with videos and audio with SimpleDirectoryReader and it was a quite easy process, but now I see very drawn out multimodal setups and deps

dosubot[bot] commented 7 months ago

The shift from SimpleDirectoryReader to a more complex MultiModal setup for image analysis with GPT-4 Vision is aimed at leveraging the advanced capabilities of Large Multi-modal Models (LMMs) like GPT-4V. This change enhances the precision and contextuality of responses by allowing for the joint input of images and text. The MultiModal approach enables a more nuanced interaction with visual data, leading to more accurate and relevant information extraction. It incorporates detailed prompts and guides the model through the analysis process more systematically.

This evolution towards MultiModal setups is designed to harness the full potential of GPT-4 Vision, ensuring that image analysis is not only direct and efficient but also enriched with contextual understanding. It supports a deeper integration of text and image analysis capabilities, crucial for applications requiring detailed analysis of visual data alongside textual information. The context provided by one modality can significantly influence the interpretation of the other, leading to more precise and less error-prone responses.

For further details and examples on how to implement and benefit from the MultiModal setup, you can refer to the following resources:

This approach significantly reduces the occurrence of hallucinations and increases the accuracy of the analysis by leveraging contextual cues from both text and image modalities for a more comprehensive understanding.

#### Details

**Sources** - [docs/docs/examples/multi_modal/gpt4v_multi_modal_retrieval.ipynb](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/multi_modal/gpt4v_multi_modal_retrieval.ipynb) - [docs/docs/examples/multi_modal/gpt4v_experiments_cot.ipynb](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/multi_modal/gpt4v_experiments_cot.ipynb) - [docs/docs/module_guides/models/multi_modal.md](https://github.com/run-llama/llama_index/blob/main/docs/docs/module_guides/models/multi_modal.md) - [llama-index-core/llama_index/core/readers/file/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/readers/file/base.py) - [llama-index-legacy/llama_index/legacy/readers/file/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/readers/file/base.py) - [docs/docs/module_guides/loading/simpledirectoryreader.md](https://github.com/run-llama/llama_index/blob/main/docs/docs/module_guides/loading/simpledirectoryreader.md) - [docs/docs/examples/data_connectors/simple_directory_reader.ipynb](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/data_connectors/simple_directory_reader.ipynb)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

logan-markewich commented 7 months ago

You can still give an image to gpt4

https://docs.llamaindex.ai/en/stable/module_guides/models/multi_modal/

Here's a fun one https://docs.llamaindex.ai/en/stable/examples/multi_modal/multi_modal_pydantic/?h=multimodal

WAS-PlaiLabs commented 7 months ago

You can still give an image to gpt4

https://docs.llamaindex.ai/en/stable/module_guides/models/multi_modal/

Here's a fun one https://docs.llamaindex.ai/en/stable/examples/multi_modal/multi_modal_pydantic/?h=multimodal

The issue I'm facing is this breaks the entire paradigm of our node structure, where now we can't use service_context with multimodals because of an error about no callback function or something -- So we can't even use Multimodal with our existing setups and have to rewrite the entire base. But service_context is a foundational part of how we get the right models context to the right nodes within our system.

It seems important to go through deprecation periods where we have legacy support and enough time to implement changes instead of the whole system becoming broken one update to another. There doesn't seem to be any indication on the github page that this is an unstable alpha project when installing through pip.

logan-markewich commented 7 months ago

The callback thing is a bug, it should be fixed.

Service context is deprecated but still fully supported for the coming months

WAS-PlaiLabs commented 7 months ago

What is the replacement for passing a specific model to vector indexes and everything else if service_context is deprecated? We use multiple models at once and need clear separation of flows.

logan-markewich commented 7 months ago

You pass in models to where they are used

VectorStoreIndex(..., embed_model=embed_model)

index.as_query_engine(..., llm=llm)

Every api that uses an embed model or llm accepts it in the api. You can also change the global defaults. This is detailed here https://docs.llamaindex.ai/en/stable/module_guides/supporting_modules/service_context_migration/?h=settings

https://docs.llamaindex.ai/en/stable/module_guides/supporting_modules/settings/?h=settings

logan-markewich commented 7 months ago

It's very explicit now what objects and classes use which models.

For example, you don't need to initialize an llm to build a vector index

WAS-PlaiLabs commented 7 months ago

Another question, wrapping back to callback error, when will that fix be released on the pip packages? I seem to up to date on all packages but still get the callback error with openai multimodal.

ValueError: "OpenAIMultiModal" object has no field "callback_manager"

I remember I tried not service_context and just documents for part of this process, and also got an error. I'll need to investigate what was wrong there. I believe all the querying with the API is using LLM for the most part, or using the models own methods like .chat() and .chat_engine() or .complete()

logan-markewich commented 7 months ago

Its not fixed. I know what the issue is. Just need to make time to fix it (or you or someone else could make the PR)

But, it's also the weekend and I'm mostly AFK for now

WAS-PlaiLabs commented 7 months ago

Did this patch slip in somewhere? Sorry, I've been trying to monitor the commits but may have missed it.

WAS-PlaiLabs commented 6 months ago

I still get the same error about no callback manager field when I use MultiModal with query engine: "OpenAIMultiModal" object has no field "callback_manager" I see there is a param for callback manager on the OpenAIMultiModel, does that need to be explicitly set with a callback manager?

run-llama / llama_index

[Question]: Analyze a image with GPT4 Vision? #12978

Question Validation

Question