kimjammer / Neuro

A recreation of Neuro-Sama originally created in 7 days.
MIT License
198 stars 26 forks source link

Multimodality Feature #4

Closed kimjammer closed 5 months ago

kimjammer commented 6 months ago

While the basic structure for multimodality integration is there in the code, I cannot find a suitable model to run it with. Most models' projectors are either too low resolution (~400px), or the underlying LLM is too weak to be usable. The only (open source) multimodal LLM that has a high enough resolution, is smart enough, and is good enough at OCR seems to be OpenGVLab/InternVL, but that model is way too large to run on anything I have access to.

If new models come out that meet the above requirements, please let me know about them in this issue. Thanks!

kimjammer commented 5 months ago

New Multimodal LLMs are coming out 🎉 . Currently investigating Phi-3-Vision and MiniCPM-Llama3-V-2_5. If you have any thoughts let me know.