[Feature Request]: Implement a MultiModalChatEngine

gich2009 commented 2 weeks ago

Feature Description

Basically a SimpleChatEngine equivalent for Multimodal models so they can also take in memory and chat_history parameters.

Reason

Don't know if there is already an abstraction for this but I can't seem to find a good way to make memory play well with multimodals

Value of Feature

For consistency. llama-index generally allows you to pass in chat_messages or chat_history or messages to their abstractions but I can't find a way to do the same for multimodal classes.

gich2009 commented 2 weeks ago

Just noticing that the OpenAIMultiModal and GeminiMultiModal interfaces have .chat() behaviours implemented. AnthropicMultiModal has not yet implemented these but I'm sure they'll be implemented eventually. The issue is that the .chat() methods do not take image_documents as a parameter. Conversely, .complete() methods support image_documents as a parameter but they do not take in messages. How do I maintain a stateful conversation with the multimodal llm where I am passing in image_documents?

gich2009 commented 2 weeks ago

Actually would be super to allow passing in a memory object since llama-index offers so many different types of memory classes.

gich2009 commented 5 days ago

May not be immediately necessary to implement this. Here is a work around for anyone else who is interested:

openai_mm_llm = OpenAIMultiModal( model="gpt-4o-mini", api_key=SECRET_KEY, max_new_tokens=4000, image_detail="auto", temperature=0, timeout=100, )

image_documents = SimpleDirectoryReader("./image").load_data() print(image_documents)

if name == "main": import time

current_mm_llm = openai_mm_llm
start_time = time.perf_counter()
memory = ChatMemoryBuffer.from_defaults()
prompt = "please explain what the image contains."
message = ChatMessage(role="user", content=prompt)
memory.put(message)

prompt = []
for message in memory.get_all():
    prompt.append(str(message))

prompt = "\n".join(prompt)

response = current_mm_llm.complete(
    prompt=prompt,
    image_documents=image_documents,
)
print(response)

end_time = time.perf_counter()

print(f"Time taken {end_time - start_time}")

run-llama / llama_index