reshaped neural-speed as a full functional inference engine for vllm
integrated vllm ns extension into llm-on-ray and optimized deployment with ray
optimized neural-speed in several places, including compute graph construction, multiple numa node deployment and enabling flash attention kernel on llama-3-8b.
updated and fixed some benchmark script for IDC test and open-ai mode test, including multiple messages with different roles, removing empty chunk, fixing wrong first token latency and next token latency in open-ai mode.
only Llama-2-7b-chat-hf and Llama-3-8b-instruct are supported. But it can quickly extend to support other models.
addressed some review comments in last closed PR.
2X perf improvement compared to plain vLLM cpu.
This PR is to replace the closed PR, https://github.com/intel/llm-on-ray/pull/264, which is from old branch. This PR merged some enhancements from NS main branch.
reshaped neural-speed as a full functional inference engine for vllm integrated vllm ns extension into llm-on-ray and optimized deployment with ray optimized neural-speed in several places, including compute graph construction, multiple numa node deployment and enabling flash attention kernel on llama-3-8b. updated and fixed some benchmark script for IDC test and open-ai mode test, including multiple messages with different roles, removing empty chunk, fixing wrong first token latency and next token latency in open-ai mode. only Llama-2-7b-chat-hf and Llama-3-8b-instruct are supported. But it can quickly extend to support other models. addressed some review comments in last closed PR. 2X perf improvement compared to plain vLLM cpu.