Open lichongod opened 2 weeks ago
Thank you for your interests!
I noticed that the "tokens" are diffferent under the two methods. Is this "tokens/time" measured on Qasper or under a specific input setting (e.g., 1024 prompt length & 512 output length)? And what is your hardware?
Thank you for response! It was executed according to your evaluation script ‘scripts/long_test.sh’, measuring the total run time and the number of tokens generated, and the code was run on A 100 80G
I also encounter similar issues. When I test the generation throughput on LongBench using script "scripts/long_test.sh", with the following parameter details:
gpuid=0
model='JackFram/llama-160m'
quant_method='kivi'
k_bits=2
v_bits=2
group_size=32
residual_length=128
e=0
CUDA_VISIBLE_DEVICES=$gpuid python pred_long_bench.py \
--model_name_or_path $model \
--cache_dir ./cached_models \
--quant_method $quant_method \
--k_bits $k_bits \
--v_bits $v_bits \
--group_size $group_size \
--residual_length $residual_length \
--e $e
I found the throughput of using KIVI is slower than baseline (i.e. original transformer LlamaForCausalLM).
I have observed that in Figure 4 (Memory usage and throughput) of the paper, when the batch size is small, the baseline demonstrates higher throughput compared to KIVI. I suspect that the slower inference speed on LongBench originates from the small input batch size, which is 1 in the LongBench test scripts according to my understanding. Therefore, I further conduct the following experiments using "mem_spd_test.py", with the following parameters settings:
prompt_lenth = 160
output_lengths= 338
num_repeats = 3
batch_sizes = 1
Here are the results: Baseline (hugging face LlamaForCausalLM)
bs: 1, seqlen: 160+338
model: llama-160m
start time: 1715748731.4499853
used time: 3727.741797765096 ms
peak mem: 0.36052417755126953 GB
KIVI (parameters same as before)
bs: 1, seqlen: 160+338
model: llama-160m
start time: 1715748756.3054838
used time: 7293.251832326253 ms
peak mem: 0.35855579376220703 GB
It is obvious that KIVI runs slower than the baseline when the batch size is small.
For your information, I use Llama-160m for quick tests but I believe the same results hold for larger models like Llama-7B. Experiments are performed on NVIDIA Tesla P100-PCIE-16G.
How to solve this problem and speed up the KIVI generation on LongBench? Any help would be appreciated.
@CUHKSZzxy @lichongod
Thank you guys for the detailed benchmark. We also notice it in our previous experiments. The longbench is tested under the batch size == 1 setting. To ensure the correctness of our results, we frequently use extracontiguous
, transpose
, reshape
, and torch.concat
OPs to ensure the correctness of our results, which incurs extra overhead. While this overhead can be mitigated by using larger batch sizes or longer sequences, it remains significant when the KV cache is small (batch size of 1). That said, the code is not fully optimized for settings with a small KV cache.
We have some beta version to solve this problem (basically merge/delete these unnecessary extra OPs), but we haven't test it on real workload yet. We will release it once we tested it on the real workload.
@CUHKSZzxy @lichongod
Thank you guys for the detailed benchmark. We also notice it in our previous experiments. The longbench is tested under the batch size == 1 setting. To ensure the correctness of our results, we frequently use extra
contiguous
,transpose
,reshape
, andtorch.concat
OPs to ensure the correctness of our results, which incurs extra overhead. While this overhead can be mitigated by using larger batch sizes or longer sequences, it remains significant when the KV cache is small (batch size of 1). That said, the code is not fully optimized for settings with a small KV cache.We have some beta version to solve this problem (basically merge/delete these unnecessary extra OPs), but we haven't test it on real workload yet. We will release it once we tested it on the real workload.
Thanks for your prompt reply and looking forward to the optimized version!
Thanks for your prompt reply and looking forward to the optimized version!
As you can see, the top is the result with kivi 2bit applied, and the bottom is the 16bit result。 With kivi, token generation is reduced by a quarter