Open keeganmccallum opened 1 year ago
Please help us by contributing PRs. We do not have the bandwidth to work on this now.
That comment doesn't seem to be fully baked into vLLM yet. Having been the one that wrote the multi model worker, I would say FastChat should wait until vLLM has native support for loading PeftModels and then add a multi_vllm_model_worker
that leverages their support.
To make this happen faster, the best step is probably encourage vLLM to implement peft models (ideally in a way that's more flexible than Peft does it)
Now that vllm supports llama2 properly, would be awesome to have a peft weight-sharing worker like for regular model serving. The comment at the end of this issue is probably enough to spike one:
https://github.com/vllm-project/vllm/issues/182