vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.92k stars 4.3k forks source link

[Performance]: bitsandbytes quantization slow #9535

Open lance0108 opened 2 days ago

lance0108 commented 2 days ago

Proposal to improve performance

Improve bitsandbytes quantization inference speed

Report of performance regression

I'm testing llama-3.2-1b on a toy dataset. For offline inference using the LLM class, the original model from Huggingface took 45 seconds but the 4-bit model (both inflight quantized and unsloth quantized) took 71 seconds.

I wonder if I'm not serving the quantized models properly or is it expected that bnb quantization leads to very slow inference speed due to being under optimized at this point.

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`
Can't run this on my work machine.

A100-80GB vllm==0.6.3.post1

Before submitting a new issue...

jeejeelee commented 1 day ago

could you plz provibe your running script?

lance0108 commented 1 day ago
llm = LLM(
    model="<local_dir>/Llama-3.2-1B-Instruct",
    dtype=torch.bfloat16,
    quantization="bitsandbytes",
    load_format="bitsandbytes"
)

prompts = ["question1", "question2", ...]
sampling_params = SamplingParams(
    temperature=0,
    max_tokens=256
)
outputs = llm.generate(prompts, samping_params)
jeejeelee commented 1 day ago

I can reproduce your issue, but I haven't deeply investigated whether it's reasonable. Also cc @mgoin @chenqianfzh

mgoin commented 1 day ago

@lance0108 @jeejeelee I think you should assume bitsandbytes quantization only provides a performance benefit at batch_size~=1 and with short prompts. In an offline batched scenario, it is quite reasonable for weight-only quantization to not provide a benefit and hurt your token throughput.

lance0108 commented 1 day ago

Thanks, @mgoin Do we expect bnb quantization to benefit batch inference in the near future?

mgoin commented 1 day ago

@lance0108 There is no plan at the moment, optimizing for pure throughput with weight-only quantization is a losing battle. The best kernels we have for this are Marlin or Machete, which work with GPTQ/AWQ quantized checkpoints.

chenqianfzh commented 1 day ago

@lance0108 @mgoin @jeejeelee

I think it is a non-issue. Longer e2e latency is observed in vllm because more preliminary steps, namely the profile_run and cuda graph capturing, are involved in vllm.

I tested "huggyllama/llama-7b" in my local environment (i did not use meta-llama/Llama-3.2-1B-Instruct as my application for access was not approved after waiting 1hour). The e2e latency for vllm is 35.5 s, which is 13.8s for transformer.

However, let's take a look at the breakdown of the latenacies:

In vllm,

  1. model loading: 10s
  2. profile_run: 1s
  3. cuda graph capturing: 23s
  4. inference: 1.5 s

in transformer:

  1. model loading: 10s
  2. inference: 3.8s

We can see, vllm is slow is mostly because of the time cost of cuda graph capturing as well as the profile run, which happens only in the first run when running a vllm server.

Actually, the inference time of vllm, in my tests, is consistently less than transformer (1.5s vs 3.8s). I would attribute to the performance optimization inherent in the vllm design.

Therefore, if we set up a vllm server to serve a bnb model, as cuda graph capturing and profile_run will happen only in the first run, we can warm up the server with a run, then the end users should enjoy a lower latency than transformer !

jeejeelee commented 20 hours ago

We also conducted some experiments to test BNB performance (without model loading and profile_run): image