While the basic structure for multimodality integration is there in the code, I cannot find a suitable model to run it with. Most models' projectors are either too low resolution (~400px), or the underlying LLM is too weak to be usable. The only (open source) multimodal LLM that has a high enough resolution, is smart enough, and is good enough at OCR seems to be OpenGVLab/InternVL, but that model is way too large to run on anything I have access to.
If new models come out that meet the above requirements, please let me know about them in this issue. Thanks!
While the basic structure for multimodality integration is there in the code, I cannot find a suitable model to run it with. Most models' projectors are either too low resolution (~400px), or the underlying LLM is too weak to be usable. The only (open source) multimodal LLM that has a high enough resolution, is smart enough, and is good enough at OCR seems to be OpenGVLab/InternVL, but that model is way too large to run on anything I have access to.
If new models come out that meet the above requirements, please let me know about them in this issue. Thanks!