Open lance0108 opened 1 month ago
could you plz provibe your running script?
llm = LLM(
model="<local_dir>/Llama-3.2-1B-Instruct",
dtype=torch.bfloat16,
quantization="bitsandbytes",
load_format="bitsandbytes"
)
prompts = ["question1", "question2", ...]
sampling_params = SamplingParams(
temperature=0,
max_tokens=256
)
outputs = llm.generate(prompts, samping_params)
I can reproduce your issue, but I haven't deeply investigated whether it's reasonable. Also cc @mgoin @chenqianfzh
@lance0108 @jeejeelee I think you should assume bitsandbytes quantization only provides a performance benefit at batch_size~=1 and with short prompts. In an offline batched scenario, it is quite reasonable for weight-only quantization to not provide a benefit and hurt your token throughput.
Thanks, @mgoin Do we expect bnb quantization to benefit batch inference in the near future?
@lance0108 @mgoin @jeejeelee
I think it is a non-issue. Longer e2e latency is observed in vllm because more preliminary steps, namely the profile_run and cuda graph capturing, are involved in vllm.
I tested "huggyllama/llama-7b" in my local environment (i did not use meta-llama/Llama-3.2-1B-Instruct as my application for access was not approved after waiting 1hour). The e2e latency for vllm is 35.5 s, which is 13.8s for transformer.
However, let's take a look at the breakdown of the latenacies:
In vllm,
in transformer:
We can see, vllm is slow is mostly because of the time cost of cuda graph capturing as well as the profile run, which happens only in the first run when running a vllm server.
Actually, the inference time of vllm, in my tests, is consistently less than transformer (1.5s vs 3.8s). I would attribute to the performance optimization inherent in the vllm design.
Therefore, if we set up a vllm server to serve a bnb model, as cuda graph capturing and profile_run will happen only in the first run, we can warm up the server with a run, then the end users should enjoy a lower latency than transformer !
We also conducted some experiments to test BNB performance (without model loading
and profile_run
):
Proposal to improve performance
Improve bitsandbytes quantization inference speed
Report of performance regression
I'm testing llama-3.2-1b on a toy dataset. For offline inference using the LLM class, the original model from Huggingface took 45 seconds but the 4-bit model (both inflight quantized and unsloth quantized) took 71 seconds.
I wonder if I'm not serving the quantized models properly or is it expected that bnb quantization leads to very slow inference speed due to being under optimized at this point.
Misc discussion on performance
No response
Your current environment (if you think it is necessary)
A100-80GB vllm==0.6.3.post1
Before submitting a new issue...