Support multimodal models such as LLaVA for image input

cebtenzzre commented 11 months ago

Feature request

We can make use of the upstream work at https://github.com/ggerganov/llama.cpp/pull/3436 to support image input to LLMs.

@AndriyMulyar What was the name of the model that you wanted to consider as an alternative to LLaVA?

Motivation

Real-time image recognition on resource-constrained hardware would be very useful in applications such as robotics. This feature would open the door to broader use cases for GPT4All than simple text completion.

Your contribution

I may submit a pull request implementing this functionality.

AndriyMulyar commented 11 months ago

Fuyu 8b is interesting because its decoder only.

I think LLaVA style is a fine choice though for an initial multimodal implementation

manyoso commented 11 months ago

This will require extensive changes to the GUI as well. It has been agreed that the GUI changes will come first to provide a UI for the current multimodel upstream.

eiko4 commented 11 months ago

+1

PedzacyKapec commented 10 months ago

+1

nomic-ai / gpt4all