vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.59k stars 4.06k forks source link

Support for sparsity? #1574

Open BDHU opened 10 months ago

BDHU commented 10 months ago

Is it possible to do semi-structured sparsity for lower inference latency? Thanks!

simon-mo commented 10 months ago

Can you elaborate? Is there any pre-trained models you had in mind?

WoosukKwon commented 10 months ago

@BDHU @simon-mo I think this means NVIDIA's 2:4 sparse Tensor Core, which is known to increase the matmul speed by up to 2x, with potential degradation in accuracy. While the speedup is huge, I'm not sure how popular this is.

zhaoyang-star commented 8 months ago

DEJAVU, a method that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware aware implementation that speeds up LLM inference. I am not sure the workforce to port it to vLLM.

shiqingzhangCSU commented 6 months ago

Is it possible to do semi-structured sparsity for lower inference latency? Thanks!

https://github.com/neuralmagic/nm-vllm/tree/main This pr impled sparse inference.

simon-mo commented 6 months ago

@robertgshaw2-neuralmagic ^

guojunzzc commented 3 months ago

DEJAVU, a method that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware aware implementation that speeds up LLM inference. I am not sure the workforce to port it to vLLM.

is there any plan to implement it in vllm ? seems it speedup ttft greatly @simon-mo @WoosukKwon @zhaoyang-star

BDHU commented 2 months ago

DEJAVU, a method that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware aware implementation that speeds up LLM inference. I am not sure the workforce to port it to vLLM.

is there any plan to implement it in vllm ? seems it speedup ttft greatly @simon-mo @WoosukKwon @zhaoyang-star

My understanding is that DEJAVU is intended for single-user case, which might not be what vLLM is trying to target

guojunzzc commented 2 months ago

DEJAVU, a method that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware aware implementation that speeds up LLM inference. I am not sure the workforce to port it to vLLM.

is there any plan to implement it in vllm ? seems it speedup ttft greatly @simon-mo @WoosukKwon @zhaoyang-star

My understanding is that DEJAVU is intended for single-user case, which might not be what vLLM is trying to target

DEJAVU is commonly framework for all models which added layers heads selector when speed up prefill, like speculative decoder frameowork

BDHU commented 2 months ago

DEJAVU, a method that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware aware implementation that speeds up LLM inference. I am not sure the workforce to port it to vLLM.

is there any plan to implement it in vllm ? seems it speedup ttft greatly @simon-mo @WoosukKwon @zhaoyang-star

My understanding is that DEJAVU is intended for single-user case, which might not be what vLLM is trying to target

DEJAVU is commonly framework for all models which added layers heads selector when speed up prefill, like speculative decoder frameowork

I'm not sure why it has anything to do speculative decoding since they are completely different techniques, other than speeding up the inference. DEJAVU's contextual sparsity won't have much benefit when the number of requests becomes large. That's why it's more suitable for frameworks targeting single-user case where resources are constrained

guojunzzc commented 2 months ago

DEJAVU, a method that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware aware implementation that speeds up LLM inference. I am not sure the workforce to port it to vLLM.

is there any plan to implement it in vllm ? seems it speedup ttft greatly @simon-mo @WoosukKwon @zhaoyang-star

My understanding is that DEJAVU is intended for single-user case, which might not be what vLLM is trying to target

DEJAVU is commonly framework for all models which added layers heads selector when speed up prefill, like speculative decoder frameowork

I'm not sure why it has anything to do speculative decoding since they are completely different techniques, other than speeding up the inference. DEJAVU's contextual sparsity won't have much benefit when the number of requests becomes large. That's why it's more suitable for frameworks targeting single-user case where resources are constrained

I may not said it clearly, it has no relation with speculative decoding ,just the same in framework level which could support multiple models

on the other hand , I am not sure whether it will benefit from large requests or not, since it benefit comes from jump large percent of layers without loss much performance , and speed up inference with x times which would be one of the best latency options

BDHU commented 2 months ago

DEJAVU, a method that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware aware implementation that speeds up LLM inference. I am not sure the workforce to port it to vLLM.

is there any plan to implement it in vllm ? seems it speedup ttft greatly @simon-mo @WoosukKwon @zhaoyang-star

My understanding is that DEJAVU is intended for single-user case, which might not be what vLLM is trying to target

DEJAVU is commonly framework for all models which added layers heads selector when speed up prefill, like speculative decoder frameowork

I'm not sure why it has anything to do speculative decoding since they are completely different techniques, other than speeding up the inference. DEJAVU's contextual sparsity won't have much benefit when the number of requests becomes large. That's why it's more suitable for frameworks targeting single-user case where resources are constrained

I may not said it clearly, it has no relation with speculative decoding ,just the same in framework level which could support multiple models

on the other hand , I am not sure whether it will benefit from large requests or not, since it benefit comes from jump large percent of layers without loss much performance , and speed up inference with x times which would be one of the best latency options

It doesn't skip "layers" (if it's the decoding layer you mean). It essentially trains a predictor that predicts which part of the weights a single request needs. While it may speedup inference, its benefit quickly diminishes as the number of requests grows large, because intuitively, different requests may use different parts of the weight. So the larger the number of requests, the more likely requests use different part of the weights, resulting in little to no computation skipping. I think powerinfer and Apple's LLM in a flash has an implementation similar to DEJAVU, but both are targeting different use cases compared to vLLM. It will be nice to have it in vLLM, but I doubt that will be possible, at least in the near future.

guojunzzc commented 2 months ago

DEJAVU, a method that uses a low-cost algorithm to predict contextual sparsity on the fly given inputs to each layer, along with an asynchronous and hardware aware implementation that speeds up LLM inference. I am not sure the workforce to port it to vLLM.

is there any plan to implement it in vllm ? seems it speedup ttft greatly @simon-mo @WoosukKwon @zhaoyang-star

My understanding is that DEJAVU is intended for single-user case, which might not be what vLLM is trying to target

DEJAVU is commonly framework for all models which added layers heads selector when speed up prefill, like speculative decoder frameowork

I'm not sure why it has anything to do speculative decoding since they are completely different techniques, other than speeding up the inference. DEJAVU's contextual sparsity won't have much benefit when the number of requests becomes large. That's why it's more suitable for frameworks targeting single-user case where resources are constrained

I may not said it clearly, it has no relation with speculative decoding ,just the same in framework level which could support multiple models on the other hand , I am not sure whether it will benefit from large requests or not, since it benefit comes from jump large percent of layers without loss much performance , and speed up inference with x times which would be one of the best latency options

It doesn't skip "layers" (if it's the decoding layer you mean). It essentially trains a predictor that predicts which part of the weights a single request needs. While it may speedup inference, its benefit quickly diminishes as the number of requests grows large, because intuitively, different requests may use different parts of the weight. So the larger the number of requests, the more likely requests use different part of the weights, resulting in little to no computation skipping. I think powerinfer and Apple's LLM in a flash has an implementation similar to DEJAVU, but both are targeting different use cases compared to vLLM. It will be nice to have it in vLLM, but I doubt that will be possible, at least in the near future.

thank you very much for the explanation which gives much details

from the DAJAVU paper, it says "1/5 attention heads and 1/20 MLP layers" are important for a prompt, so the predictor choose which ones are important for each prompt, looks it has not much relation with num of requests

and it would be more important to provide models with the predictors, which is beyond vllm

anyway, hopefully vllm would consider it in the future

yzlnew commented 2 months ago

There's already a gptq_marlin24 choice in quantization config, it's a WIP I guess?