[Performance]: bitsandbytes quantization slow

lance0108 commented 1 month ago

Proposal to improve performance

Improve bitsandbytes quantization inference speed

Report of performance regression

I'm testing llama-3.2-1b on a toy dataset. For offline inference using the LLM class, the original model from Huggingface took 45 seconds but the 4-bit model (both inflight quantized and unsloth quantized) took 71 seconds.

I wonder if I'm not serving the quantized models properly or is it expected that bnb quantization leads to very slow inference speed due to being under optimized at this point.

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`
Can't run this on my work machine.

A100-80GB vllm==0.6.3.post1

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

jeejeelee commented 1 month ago

could you plz provibe your running script?

lance0108 commented 1 month ago

llm = LLM(
    model="<local_dir>/Llama-3.2-1B-Instruct",
    dtype=torch.bfloat16,
    quantization="bitsandbytes",
    load_format="bitsandbytes"
)

prompts = ["question1", "question2", ...]
sampling_params = SamplingParams(
    temperature=0,
    max_tokens=256
)
outputs = llm.generate(prompts, samping_params)

jeejeelee commented 1 month ago

I can reproduce your issue, but I haven't deeply investigated whether it's reasonable. Also cc @mgoin @chenqianfzh

mgoin commented 1 month ago

@lance0108 @jeejeelee I think you should assume bitsandbytes quantization only provides a performance benefit at batch_size~=1 and with short prompts. In an offline batched scenario, it is quite reasonable for weight-only quantization to not provide a benefit and hurt your token throughput.

lance0108 commented 1 month ago

Thanks, @mgoin Do we expect bnb quantization to benefit batch inference in the near future?

mgoin commented 1 month ago

@lance0108 There is no plan at the moment, optimizing for pure throughput with weight-only quantization is a losing battle. The best kernels we have for this are Marlin or Machete, which work with GPTQ/AWQ quantized checkpoints.

chenqianfzh commented 1 month ago

@lance0108 @mgoin @jeejeelee

I think it is a non-issue. Longer e2e latency is observed in vllm because more preliminary steps, namely the profile_run and cuda graph capturing, are involved in vllm.

I tested "huggyllama/llama-7b" in my local environment (i did not use meta-llama/Llama-3.2-1B-Instruct as my application for access was not approved after waiting 1hour). The e2e latency for vllm is 35.5 s, which is 13.8s for transformer.

However, let's take a look at the breakdown of the latenacies:

In vllm,

model loading: 10s
profile_run: 1s
cuda graph capturing: 23s
inference: 1.5 s

in transformer:

model loading: 10s
inference: 3.8s

We can see, vllm is slow is mostly because of the time cost of cuda graph capturing as well as the profile run, which happens only in the first run when running a vllm server.

Actually, the inference time of vllm, in my tests, is consistently less than transformer (1.5s vs 3.8s). I would attribute to the performance optimization inherent in the vllm design.

Therefore, if we set up a vllm server to serve a bnb model, as cuda graph capturing and profile_run will happen only in the first run, we can warm up the server with a run, then the end users should enjoy a lower latency than transformer !

jeejeelee commented 1 month ago

We also conducted some experiments to test BNB performance (without model loading and profile_run):

vllm-project / vllm