intel / llm-on-ray

Pretrain, finetune and serve LLMs on Intel platforms with Ray
Apache License 2.0
103 stars 30 forks source link

Vllm ns merged 209 7d49516 #272

Closed jiafuzha closed 3 months ago

jiafuzha commented 3 months ago

This PR is to replace the closed PR, https://github.com/intel/llm-on-ray/pull/264, which is from old branch. This PR merged some enhancements from NS main branch.

reshaped neural-speed as a full functional inference engine for vllm integrated vllm ns extension into llm-on-ray and optimized deployment with ray optimized neural-speed in several places, including compute graph construction, multiple numa node deployment and enabling flash attention kernel on llama-3-8b. updated and fixed some benchmark script for IDC test and open-ai mode test, including multiple messages with different roles, removing empty chunk, fixing wrong first token latency and next token latency in open-ai mode. only Llama-2-7b-chat-hf and Llama-3-8b-instruct are supported. But it can quickly extend to support other models. addressed some review comments in last closed PR. 2X perf improvement compared to plain vLLM cpu.