Closed Weiyun1025 closed 3 months ago
Hey @Weiyun1025! Thank you for making this issue and I took a brief look at the model repo https://huggingface.co/OpenGVLab/InternVL2-40B/tree/main. It seems to me that supporting this model should be pretty straightforward (similar to what we did with Phi-3-vision).
Are you planning to make a pull request on this? If so, feel free to take a look at other vision language model implementations on vLLM and let us know if you run into any issue. We're happy to help you on getting this model supported.
If you cannot make a pull request, I will try to see if I have some bandwidth to make a PR on this. Feel free to check out #4194 for the full roadmap around multi-modality.
Thanks!
🚀 The feature, motivation and pitch
InternVL2 is currently the most powerful open-source Multimodal Large Language Model (MLLM). The InternVL2 family includes models ranging from a 2B model, suitable for edge devices, to a 108B model, which is significantly more powerful. With larger-scale language models, InternVL2-Pro demonstrates outstanding multimodal understanding capabilities, matching the performance of commercial closed-source models across various benchmarks.
Given the significant potential of InternVL2, we believe that integrating it with vLLM would greatly benefit both the vLLM community and users of this model. We kindly request your assistance in enabling the deployment of InternVL2 using the vLLM framework.
We look forward to your positive response and are eager to collaborate on this exciting endeavor.
Alternatives
No response
Additional context
Blog:https://internvl.github.io/blog/2024-07-02-InternVL-2.0/ Model Family:https://huggingface.co/collections/OpenGVLab/internvl-20-667d3961ab5eb12c7ed1463e