Live inference of fine-tuned multimodal LLMs?

modelscope / swift

ms-swift: Use PEFT or Full-parameter to finetune 300+ LLMs or 50+ MLLMs. (Qwen2, GLM4v, Internlm2.5, Yi, Llama3.1, Llava-Video, Internvl2, MiniCPM-V, Deepseek, Baichuan2, Gemma2, Phi3-Vision, ...)

https://swift.readthedocs.io/zh-cn/latest/

Apache License 2.0

2.52k stars 229 forks source link

Live inference of fine-tuned multimodal LLMs? #615

Open babla9 opened 3 months ago

babla9 commented 3 months ago

Thanks for your work and the repo!

As I understand, the inference for multimodal llm (eg llava, qwen-vl) can only be run in batch via the provided scripts here: https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/llava最佳实践.md#微调后推理 ?

Any suggestions on how I can serve these models for live inference (eg exposing the service via port)?

Jintao-Huang commented 3 months ago

You can refer to this document and single-sample inference in llava's best practice https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM%E5%BE%AE%E8%B0%83%E6%96%87%E6%A1%A3.md#%E5%BE%AE%E8%B0%83%E5%90%8E%E6%A8%A1%E5%9E%8B

Jintao-Huang commented 3 months ago

Additionally, model = Swift.from_pretrained(model, ckpt_dir, inference_mode=True) needs to be added.

babla9 commented 3 months ago

Thanks for the response! This single-sample inference has a long startup time, correct? How would I be able to serve this for use with a web-ui or service that's serving 100s of queries? Where total latency <5seconds?