jy-yuan / KIVI

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
https://arxiv.org/abs/2402.02750
MIT License
142 stars 10 forks source link

Why the model inference slowly when Mistral-7B-Instruct-v0.2 apply the kivi? #15

Open lichongod opened 2 weeks ago

lichongod commented 2 weeks ago
截屏2024-05-13 11 33 08

As you can see, the top is the result with kivi 2bit applied, and the bottom is the 16bit result。 With kivi, token generation is reduced by a quarter

zirui-ray-liu commented 2 weeks ago

Thank you for your interests!

I noticed that the "tokens" are diffferent under the two methods. Is this "tokens/time" measured on Qasper or under a specific input setting (e.g., 1024 prompt length & 512 output length)? And what is your hardware?

lichongod commented 2 weeks ago

Thank you for response! It was executed according to your evaluation script ‘scripts/long_test.sh’, measuring the total run time and the number of tokens generated, and the code was run on A 100 80G

CUHKSZzxy commented 2 weeks ago

I also encounter similar issues. When I test the generation throughput on LongBench using script "scripts/long_test.sh", with the following parameter details:

gpuid=0
model='JackFram/llama-160m'
quant_method='kivi'
k_bits=2
v_bits=2
group_size=32
residual_length=128
e=0

CUDA_VISIBLE_DEVICES=$gpuid python pred_long_bench.py \
    --model_name_or_path $model \
    --cache_dir ./cached_models \
    --quant_method $quant_method \
    --k_bits $k_bits \
    --v_bits $v_bits \
    --group_size $group_size \
    --residual_length $residual_length \
    --e $e

I found the throughput of using KIVI is slower than baseline (i.e. original transformer LlamaForCausalLM).

I have observed that in Figure 4 (Memory usage and throughput) of the paper, when the batch size is small, the baseline demonstrates higher throughput compared to KIVI. I suspect that the slower inference speed on LongBench originates from the small input batch size, which is 1 in the LongBench test scripts according to my understanding. Therefore, I further conduct the following experiments using "mem_spd_test.py", with the following parameters settings:

prompt_lenth = 160
output_lengths= 338
num_repeats = 3
batch_sizes = 1

Here are the results: Baseline (hugging face LlamaForCausalLM)

bs: 1, seqlen: 160+338
model: llama-160m
start time: 1715748731.4499853
used time: 3727.741797765096 ms
peak mem: 0.36052417755126953 GB

KIVI (parameters same as before)

bs: 1, seqlen: 160+338
model: llama-160m
start time: 1715748756.3054838
used time: 7293.251832326253 ms
peak mem: 0.35855579376220703 GB

It is obvious that KIVI runs slower than the baseline when the batch size is small.

For your information, I use Llama-160m for quick tests but I believe the same results hold for larger models like Llama-7B. Experiments are performed on NVIDIA Tesla P100-PCIE-16G.

How to solve this problem and speed up the KIVI generation on LongBench? Any help would be appreciated.

zirui-ray-liu commented 2 weeks ago

@CUHKSZzxy @lichongod

Thank you guys for the detailed benchmark. We also notice it in our previous experiments. The longbench is tested under the batch size == 1 setting. To ensure the correctness of our results, we frequently use extracontiguous, transpose, reshape, and torch.concat OPs to ensure the correctness of our results, which incurs extra overhead. While this overhead can be mitigated by using larger batch sizes or longer sequences, it remains significant when the KV cache is small (batch size of 1). That said, the code is not fully optimized for settings with a small KV cache.

We have some beta version to solve this problem (basically merge/delete these unnecessary extra OPs), but we haven't test it on real workload yet. We will release it once we tested it on the real workload.

CUHKSZzxy commented 2 weeks ago

@CUHKSZzxy @lichongod

Thank you guys for the detailed benchmark. We also notice it in our previous experiments. The longbench is tested under the batch size == 1 setting. To ensure the correctness of our results, we frequently use extracontiguous, transpose, reshape, and torch.concat OPs to ensure the correctness of our results, which incurs extra overhead. While this overhead can be mitigated by using larger batch sizes or longer sequences, it remains significant when the KV cache is small (batch size of 1). That said, the code is not fully optimized for settings with a small KV cache.

We have some beta version to solve this problem (basically merge/delete these unnecessary extra OPs), but we haven't test it on real workload yet. We will release it once we tested it on the real workload.

Thanks for your prompt reply and looking forward to the optimized version!

lichongod commented 2 weeks ago

Thanks for your prompt reply and looking forward to the optimized version!