Question about batch and quantization

Tiiiger commented 1 year ago

Hi @OlivierDehaene and @Narsil ,

Thank you for maintaining this great repo. Quick question, technically the optimized kernels in GPTQ and BNB 4bits only support batch size 1. I am a bit confused about how the dynamic batching in this repo is implemented under the hood. Are we using batch size 1 for these kernels? otherwise, my impression is that the quantized model will run slower.

As mentioned in https://github.com/TimDettmers/bitsandbytes/releases, "This means you can take advantage the new kernels by separating a multi-batch 4-bit query into multiple requests with batch size 1". My question is whether we are doing this in this repo.

I have some very preliminary results showing that if we have more than one concurrent requests, the latency becomes much higher (60 ms per token for batch size 1 vs. 160ms per token for batch size 2) when using bnb-nf4. This is testing a 33B LLaMA model on a 4xL4 machine with 4 Shards. So I am thinking probably we are not using the optimized batch-size-1 kernel under the hood.

Let me know any additional information I can help provide.

Thank you!

Narsil commented 1 year ago

Hi @Tiiiger

This repo will indeed stack all requests into 1, if it' spossible.

It's possible IF AND ONLY IF flash attention is enabled (either v1 or v2). and that model actually supports it (so no alibi models).

In that case we stack requests without padding and everything is batch size one.

However during benchmarking, we used flash attention enabled models and found out that the performance did decrease even if the BS stayed at 1...

https://github.com/huggingface/text-generation-inference/pull/586#issuecomment-1632225681

Both those benchmarks are realized when the conditions should be met. Is this expected behavior ? I'm happy to add logging proving the shapes of the tensors going in and out if you want.

Tiiiger commented 1 year ago

Hi @Narsil

Thank you so much for replying. Yes I was testing with flash attention v2 with llama (vicuna 33B) so I think that condition is met.

From your result, it also seems that batch size 1 is much faster than batch size 2. What I am seeing is that when there are more than one concurrent requests, it seems that the requests were not broken into two batch size 1 requests. I am issuing concurrent requests through multiprocessing. Here is an illustrative minimal example for how I am benchmarking this

import multiprocessing as mp
import time
from text_generation import Client

def hf_worker():
    client = Client("http://0.0.0.0:20888", timeout=60)
    prompt = "Explain to me what is deep learning."

    output = ""
    time_start = time.time()
    num_tokens = 0
    for i, response in enumerate(client.generate_stream(prompt, max_new_tokens=512)):
        if not response.token.special:
            output += response.token.text
            num_tokens += 1
    time_elapsed = time.time() - time_start

    return time_elapsed

concurrent_request = 1
with mp.Pool(processes=concurrent_request) as pool:
   time_stats = pool.imap_unordered(hf_worker, range(concurrent_request))

By changing the concurrent_request between 1 and 2, the result seems quite different. 60 ms per token for batch size 1 vs. 160ms per token for batch size 2.

I'm happy to add logging proving the shapes of the tensors going in and out if you want.

This will be super helpful to check what's going on in my environment. Could also be that I didn't setup something correctly. Thank you so much!

Tiiiger commented 1 year ago

Also to follow up, are the benchmarking results you linked done with bnb 0.40.0?

Just want to make sure. Thanks!

Narsil commented 1 year ago

0.40.0 or 0.40.1 it was rather early but I don't remember specifically

Tiiiger commented 1 year ago

0.40.0 or 0.40.1 it was rather early but I don't remember specifically

Got it! 0.41.0 behaves a bit different for me (i.e. faster on A100). Just mentioning in case you might be interested in benchmarking this agian!

Narsil commented 1 year ago

I confirm the numbers, on 0.41.1

This is what the model is seeing Seeing torch.Size([8, 4544]) (8 here is a sequence length, not batch size). Maybe there's a confusion of what batch size actually means, batch size would mean [BATCH_SIZE, SEQ_LEN, HIDDEN_STATES] for me.

The model here is always seeing [BATCH_SIZE * SEQ_LEN, HIDDEN_STATES] (well actually SUM(SEQ_LEN) to be precise..)

Tiiiger commented 1 year ago

Yes I think I have the same understanding of batch size with you.

https://github.com/TimDettmers/bitsandbytes/blob/main/bitsandbytes/autograd/_functions.py#L566 Looking at this, the 4bit cuda kernel is only used when the input is [1, HIDDEN_STATES]. So it makes sense why we see a slow down.

In your example, why would the model see Size([8, 4544])? I am thinking in the decode step, shouldn't the model just operate on the last token and use the KV store for the previous tokens?

Narsil commented 1 year ago

Yes, but I'm benchmarking and it's stacking 8 requests together. (1 + 1 + 1...)

This is normal in benchmarking how the continuous batching works.

Narsil commented 1 year ago

(text-generation-benchmark will run at various stacked sizes of the decode).

Tiiiger commented 1 year ago

https://github.com/TimDettmers/bitsandbytes/releases

The inference kernels for batch size 1 are about 8x faster than 4-bit training kernel for QLoRA. This means you can take advantage the new kernels by separating a multi-batch 4-bit query into multiple requests with batch size 1.

right I guess my understanding from reading this is that we should not stack requests together when using bnb-nf4? we should send 8 tensors like [1, HIDDEN_STATES]. instead of stack them like [8, HIDDEN_STATES].

I guess there is probably a trade-off between using the fast batch-size 1 kernel versus a larger batch size (probably more than batch size 2 or 4 because those in my benchmarking are slower than using 2 times batch size 1). Is there a way to turn on/off stacking the requests together?

Narsil commented 1 year ago

Running 1 kernel should almost always be faster than running 2, because of the kernel call overhead.

If it's not the case I would consider this a bug/potential improvement in the kernel, and we can wait for better versions coming in future versions.

In any case, we're not going to modify the modeling code for bnb (or any other quantization). Having orthogonal quantization vs modeling vs TP (almost 100%, only load requires sometimes a bit of attention) is a very comfortable situation where we can add a new quantization technique without fearing breaking some other one.

Apparently there's still some improvement coming in BNB.

Closing this for now.

huggingface / text-generation-inference

Question about batch and quantization #800