Closed babla9 closed 2 weeks ago
You can refer to this document and single-sample inference in llava's best practice https://github.com/modelscope/swift/blob/main/docs/source/LLM/LLM%E5%BE%AE%E8%B0%83%E6%96%87%E6%A1%A3.md#%E5%BE%AE%E8%B0%83%E5%90%8E%E6%A8%A1%E5%9E%8B
Additionally, model = Swift.from_pretrained(model, ckpt_dir, inference_mode=True)
needs to be added.
Thanks for the response! This single-sample inference has a long startup time, correct? How would I be able to serve this for use with a web-ui or service that's serving 100s of queries? Where total latency <5seconds?
Thanks for your work and the repo!
As I understand, the inference for multimodal llm (eg llava, qwen-vl) can only be run in batch via the provided scripts here: https://github.com/modelscope/swift/blob/main/docs/source/Multi-Modal/llava最佳实践.md#微调后推理 ?
Any suggestions on how I can serve these models for live inference (eg exposing the service via port)?