Live inference of fine-tuned multimodal LLMs?

modelscope / ms-swift

Use PEFT or Full-parameter to finetune 300+ LLMs or 80+ MLLMs. (Qwen2, GLM4v, Internlm2.5, Yi, Llama3.1, Llava-Video, Internvl2, MiniCPM-V-2.6, Deepseek, Baichuan2, Gemma2, Phi3-Vision, ...)

https://swift.readthedocs.io/zh-cn/latest/Instruction/index.html

Apache License 2.0

3.39k stars 289 forks source link

Live inference of fine-tuned multimodal LLMs? #615

Closed babla9 closed 2 weeks ago

babla9 commented 5 months ago

Thanks for your work and the repo!

As I understand, the inference for multimodal llm (eg llava, qwen-vl) can only be run in batch via the provided scripts here: https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/llava最佳实践.md#微调后推理 ?

Any suggestions on how I can serve these models for live inference (eg exposing the service via port)?

Jintao-Huang commented 5 months ago

You can refer to this document and single-sample inference in llava's best practice https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM%E5%BE%AE%E8%B0%83%E6%96%87%E6%A1%A3.md#%E5%BE%AE%E8%B0%83%E5%90%8E%E6%A8%A1%E5%9E%8B

Jintao-Huang commented 5 months ago

Additionally, model = Swift.from_pretrained(model, ckpt_dir, inference_mode=True) needs to be added.

babla9 commented 5 months ago

Thanks for the response! This single-sample inference has a long startup time, correct? How would I be able to serve this for use with a web-ui or service that's serving 100s of queries? Where total latency <5seconds?