[Question]: It seems that minference does not currently support tensor parallelism under vllm, right? Because in a multi-card environment, the head_id here is incorrect compared to a single card

microsoft / MInference

To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 while maintaining accuracy.

https://aka.ms/MInference

MIT License

695 stars 25 forks source link

[Question]: It seems that minference does not currently support tensor parallelism under vllm, right? Because in a multi-card environment, the head_id here is incorrect compared to a single card #62

Open zh2333 opened 1 month ago

zh2333 commented 1 month ago

Describe the issue

iofu728 commented 1 month ago

Hi @zh2333,

Thanks for your support.

Currently, the vllm version does not support TP, but we expect to add this feature by the middle of next month. I'll close issue #63 due to duplicate content.

zh2333 commented 1 month ago

Hi @zh2333,

Thanks for your support.

Currently, the vllm version does not support TP, but we expect to add this feature by the middle of next month. I'll close issue #63 due to duplicate content.

Thank you very much for your reply. Looking forward to supporting TP features in minference!