Support multimodal LLMs?

huggingface / candle

Minimalist ML framework for Rust

Apache License 2.0

14.21k stars 794 forks source link

Support multimodal LLMs? #1947

Open guoqingbao opened 3 months ago

guoqingbao commented 3 months ago

Do you have any plans to support multimodal LLMs, such as MiniGPT-4/MiniGPT v2 (https://github.com/Vision-CAIR/MiniGPT-4/) and LLaVA (https://github.com/haotian-liu/LLaVA/)? That would be a significant enhancement if these popular multimodal LLMs were supported in Candle. I believe these multimodal LLMs are built upon Vision Transformer (ViT) (CLIP, BLIP, etc.) and foundational language models including LLaMa and Mistral, which are already supported in Candle.

LaurentMazare commented 3 months ago

A step in this direction is @santiagomed adding moondream to candle-transformers, see this readme.

guoqingbao commented 2 months ago

A step in this direction is @santiagomed adding moondream to candle-transformers, see this readme.

Fantastic! Thanks for the new model!

ghost commented 2 months ago

I'll give llava a shot. Would be great to have more multi-modal models in here.

EDIT: Been busy but still want to work on this. Will pop into discord to chat with folks about how to approach this.

chenwanqq commented 4 weeks ago

I have implemented LLaVA at candle-llava. Will contribute to this project soon.

LaurentMazare commented 4 weeks ago

I have implemented LLaVA at candle-llava. Will contribute to this project soon.

Sounds great, looking forward to have this included!

louis030195 commented 3 weeks ago

bump for https://huggingface.co/microsoft/Phi-3-vision-128k-instruct :D