Feature Request - Multimodal I/O: Support multimodal input and output

pranav-kural commented 2 months ago

Checklist

[x] I have searched the existing issues and this feature has not been requested before.
[x] I have checked the QvikChat Feature Release project and this feature is not listed there.
[ ] Optional: I have read the QvikChat documentation and there is no alternative to this feature.
[ ] Optional: I am willing to implement this feature and submit a pull request.

Description

Add support for Multimodal I/O: Support multimodal input and output.

Impact (Why is this feature important?)

Will allow users to add media (images, videos, etc.) in input and get output that contains media and not just text responses.

Select Components this Feature will Impact

Select component(s) this feature will impact

[ ] API Key Authentication
[x] Chat Agents
[x] Chat Endpoints
[x] Chat History
[x] Embedding Models
[ ] Endpoint Deployment
[x] LLM Models
[x] RAG
[x] Response Caching
[x] Vector Stores
[ ] Other

Proposal (Optional)

Will require several changes to multiple components.

Modify ChatAgent class: should have a method that can handle multimodal input and generate multimodal output.
Modify defineChatEndpoint to use the new method created in ChatAgent class to support multimodality.
Add a new flag 'enableMultimodality` to allow creation of endpoints that support multimodal input and output.
Chat history: will need to update handling of chat history
- how chat history will store non-text information present in previous chat messages.
- how can this non-text information be correctly retrieved and re-used when conversation is continued.
Vector store and RAG: will need to add support for storing and retrieving non-text information.

Could roll out the support for multimodal input-output for only chat endpoints not using the chat history and RAG, or just specify in the documentation that non-text information will not work with chat history and RAG.

Alternatives (Optional)

Can still use chat endpoints to generate multi-media content. Output will likely contain a URL to the generated content.

Can not provide multi-media content to models that support multi-modal input right now.

Resources (Optional)

Include any resources, references, or links that might be helpful in understanding or implementing this feature.

pranav-kural commented 1 month ago

On testing, observed that when defining an open-ended chat endpoint, query never gets sent to the LLM, due to an issue in the prompt. The prompt template being used didn't have the {{query}} construct.

Also noticed warnings regarding prompt with certain name (e.g., openEndedSystemPrompt) being over-written.

pranav-kural commented 1 month ago

Fix for above issue added pre-release: #59

oconva / qvikchat