Support PagedAttention - Githubissues

Atry commented 1 year ago

Feature request

vLLM is fast with efficient management of attention key and value memory with PagedAttention, serving higher throughput than TGI.

Motivation

Adopting PagedAttention would increase throughput and reduce VRAM usage, especially when using beam search.

Your contribution

I don't have plan to work on it right now. This is a backlog to me.

OlivierDehaene commented 1 year ago

We will run some benchmarks and see if it make sense to add it to TGI

ArnaudHureaux commented 1 year ago

It is true that it's 2 time faster than TGI ?

Narsil commented 1 year ago

Not in latency (Depends on the benchmark/hardware, but it is basically on par).

PagedAttention seems to be nicer with respect to VRAM usage meaning it's better when you're low on VRAM. This mostly affects throughput at those regimes, not latency.

It's definitely worth looking into it for us to help that regime, but it's not going to be day and night.

stereoplegic commented 1 year ago

This plus QLoRA (esp if it can be combined with higher-context attention fixes e.g. Landmark, FlashAttention, ALiBi) would be huge for all sorts of things, but esp. CoT/ToT reasoning, retrieval/tool augmented responses, and synthetic dataset generation/augmentation (can we finally stop relying on OpenAI for this please?).

ArnaudHureaux commented 1 year ago

Not in latency (Depends on the benchmark/hardware, but it is basically on par).

PagedAttention seems to be nicer with respect to VRAM usage meaning it's better when you're low on VRAM. This mostly affects throughput at those regimes, not latency.

It's definitely worth looking into it for us to help that regime, but it's not going to be day and night.

Ok so in other words, it's reducing the size of the pipe, but not the pipe flow rate, and it will reduce the cost of the necessary infrastructure but not reduce the time per words ? :)

ArnaudHureaux commented 1 year ago

This plus QLoRA (esp if it can be combined with higher-context attention fixes e.g. Landmark, FlashAttention, ALiBi) would be huge for all sorts of things, but esp. CoT/ToT reasoning, retrieval/tool augmented responses, and synthetic dataset generation/augmentation (can we finally stop relying on OpenAI for this please?).

How are you doing data augmentation with OpenAI ?

why this & QLora could allow you to do CoT/ToT reasoning & retrieval/tool augmented responses ? because they are compute-consuming task ?

Thanks by advance for your answers

andreapiso commented 1 year ago

PagedAttention seems to be nicer with respect to VRAM usage meaning it's better when you're low on VRAM. This mostly affects throughput at those regimes, not latency.

Correct, I think a lot of the benchmarking and hype around vLLM leave this part aside, if your model just barely fits on the GPU, you are not going to see improvements. I have tried vLLM with Starcoder on A100, and in many cases, it actually performs worse than vanilla HF.

In this scenario, do you think it makes sense to shard over 2 GPUs a model that can fit in a single GPU, paying the sharding latency price chasing this magic 24x improvement, or just stick to deploying 2 load balanced replicas and get 2x improvement.

Narsil commented 1 year ago

In this scenario, do you think it makes sense to shard over 2 GPUs a model that can fit in a single GPU, paying the sharding latency price chasing this magic 24x improvement, or just stick to deploying 2 load balanced replicas and get 2x improvement.

My personal opinion is that anythign performance related: Just measure and keep measuring, I don't have an all out formula.

At HF, sharding is usually a latency benefit on LLM, and have more free VRAM enables more batching meaning more users on the same hardware which is $$ efficient. This holds when the model is memory bound (which is usually the case when using LLMs on big GPUs), but isn't if you're compute bound. text-generation-benchmark helps us figure out our memory/compute boundary (look for inflection points in your curves in the decode part.)

For readers:

Memory bound -> More batches increases throughput linearly and latency is basically the same. Compute Bound -> More batches increases latency linearly and throughput is basically the same

OlivierDehaene commented 1 year ago

I have tried vLLM with Starcoder on A100, and in many cases, it actually performs worse than vanilla HF.

Have you tried running starcoder with TGI? You should see a large improvement as we have re-written the whole modeling code to optimise it. Also, it should be noted taht vLLM does not properly implement multi-query-attention yet so its kv cache actually takes 48x more memory than it should for starcoder.

andreapiso commented 1 year ago

Yes, tgi is what we are using today and it's the best we have so far :) we were trying vLLM after some articles were reporting very appealing numbers (23-24x over huggingface?) but have not seen those gains in our case.

And yes, we did see that the PR uses MHA instead of MQA, but did not realise it would be that inefficient.

I guess the next thing I should do is to try and test whether 2 load balanced replicas of tgi with starcoder, with one GPU each, are better or worse than one instance sharding the model over 2 GPUs.

We are trying to scale our internal starcoder/starchat based coding assistant to 2000 developers with as few GPUs as possible, so tools like tgi give us immense value, thanks a lot for all the good work.

aliswel-mt commented 1 year ago

@andreapiso May I ask the latency of your starcoder with tgi, as I got high latency (around 4s) for 128 max_new_token in A100-40G

huggingface / text-generation-inference

Support PagedAttention #478

Feature request

Motivation

Your contribution