Closed mxjmtxrm closed 3 months ago
Here is another question about smooth code.
Why do these operations on y_range? especially view
and expand
. https://github.com/mit-han-lab/lmquant/blob/c5c10897da4957cba47dd798080f5b2e1321f474/lmquant/quant/calib/calibrator/smooth.py#L289-L292
Hi @mxjmtxrm, thanks for your interest in our QServe and LMQuant! About the configuration file, I am working on a detailed documentation introducing all parameters.
As for your question about qdtype,
sint8
indicates symmetric signed INT8, quantization range is from [-127, 127]zint8
indicates asymmetric signed INT8 (i.e. with zero point), quantization range is from [-128, 127] with signed INT8 zero pointuint8
indicates unsigned INT8, quantization range is from [0, 255]nint8
indicates asymmetric unsigned INT8 (i.e., with zero point), quantization range is from [0, 255] with unsigned INT8 zero point. zint8
and nint8
is mathematically the same (with an offset of 128). As for your question on SmoothAttention,
transformers
is implemented in a manner that channel i and channel i+D/2 are paired (see here). scale[i]==scale[i+D/2]
. scale
is determined by the per-channel activation range. In the code, y_range
is a tensor with shape (C,)
and can be viewed as (num_heads, 2, head_size / 2)
since C = num_heads * head_size
.scale[i]==scale[i+D/2]
, we perform reduction on the dim=1
on the Line 291. After reduction, y_range
is in the shape of (num_heads, 1, head_size/2)
. By expanding to the shape (num_heads, 2, head_size/2)
, we let y_range[i]==y_range[i+D/2]
, which will lead to scale[i] == scale[i+D/2]
.I got it. Thanks.
@synxlin
It seems that the config yaml isn't actually used by the run file here。
I simply use the yaml file here and just modify enabled_smooth
from true to false, and when I debug the code, the flag enabled_smooth
seems to be still true.
Is there any minor bugs?
Hi, @brisker. How did you set enabled_smooth
to false
? To disable smooth, you can set enable_smooth
(not enabled_smooth
) to false
in the yaml file.
@synxlin
Hi, enable_smooth
set and problem solved.
Besides, after I run the w8a8-per-tensor static kv quantization for llama2-13b using this line code, I got a folder containing model.pt and scale.pt.
But the weird thing is that I can not find any quantization parameters corresponding to kv8 static per-tensor quantization, so after I convert this folder using this code, and run the generated files by this code to get w8a8kv8 benchmarking running speed, I am not even sure, whether it's w8a8-with-per-static-kv8-quantization or w8a8-with-dynamic-per-token-kv8-quantization. It seems that kv8 quantization parameters has not been used by QServe, nor has it been generated by lmquant and saved.
Hi, @brisker. You are right about this since I did not save the per-tensor scale factor of KV cache in the scale.pt
. We are still working on cleaning our code for all cases (W8A8KV8-static, etc.)
@synxlin So QServe can not support static kv quantization right now, right?
Hi @brisker , thank you for your interests in QoQ and QServe. Yes. In QServe, we mainly support dynamic quantization in the current version, and the scaling factors/zeros for kv quantization are also stored in the kv page. We found that dynamic quantization for kv cache achieves better accuracy-efficiency trade-offs.
@synxlin Thanks for your prompt reply.
Besides, when I was reading the codes of QServe, I found that you put the rotary_pos_emb and attention in one cuda function, and even the first token generation and other following token generation do not share the same cuda function, which uses this and this respectively.
So my question is:
why do you couples rotary_pos_emb with attention? It seems that if qserve needs to support more models(for example, internLM2 or Qwen, and if more kinds of rotary_pos_emb happens, this coupling can make the codes not flexible. In VLLM here and here, these two are seperated.
what is the difference between the two functions mentioned above.(first token and non-first-token generation cuda function)
Why adding q_rotary_emb
and k_rotary_emb
for attention module here? I have not seen this in vllm.
Hi, @brisker. For your questions,
rotary_pos_emb
into attention kernel can help reduce the number of kernel calls and achieve better throughput. This is a widely adopted solution in current efficient LLM systems, e.g., TensorRT-LLM.prefilling
stage where computations in Attention are GEMM while non-first-token generation is the decoding
stage where computations in Attention are GEMV.q_rotary_emb
and k_rotary_emb
whose outputs are post-RoPE Q and K.@synxlin
w4a8kv4-gs128 means per-channel-static-w4 per-token-dynamic-a8 per-head-dynamic-kv4, right?
In your latest version of paper, is there any minor slip of pen in the following part ?
@brisker Thank you for helping us find this typo. We will fix it in the next version of our paper.
@synxlin
In your paper, you mentioned dynamic per_head kv_cache quantization with scales and zero_points.
Given a kv_cache value with shape[batchsize, attention_head_nums, token_num, hidden_dim/attention_head_nums]
, is your quantization scales and zero_points in shape:
[batchsize, attention_head_num,token_num]
???
( I run your kv4 quanti codes on wikitext test on llama, the float kv_cache data is in shape [2048,4096],
and the scales are in shape [2048,1,32,1]
)
If this is true, you seems to only quantize every hidden_dim/attention_head_nums
tensor elements into 4bit with the same quanti scales , and hidden_dim/attention_head_nums usually comes at 128 for llama, anything I said wrong?
In Table.3, the zero-shot accuracy, is the accuracy Llama-1 or Llama-2 ?
In Table.3, is the kv4 quantization scales and zeros got just by simple pytorch amin and amax operation, no other tricks?
I am getting some firewall issues, how to evalute zero-shot accuracy on your codes? It seems that the datasets are downloaded online( I have all the data offline on disk), but I can not find where to use the datasets.load_from_disk()
function
@brisker, for your questions:
hidden_dim/attention_head_nums
elements share the same quantization scale and zero point. lm_eval
to evaluate zero-shot accuracy; you check code here. You can refer to lm_eval
documentation to see how to parse the local dataset to the simple_evaluate()
. Thanks for your prompt reply!
I notice that the w4a8 model files size is half of that of w8a8, and w8a8 is half that of fp16, but the file dtype is still int8 in w4a8kv4 models, and the element num is half of w8a8 and fp16. So you just store N/2 int8 numbers to represent N int4 numbers( original tensor numbers are N), right? This is necessary, due to there is no int4 storage dtype now.
reshape(linear.out_features // 32,2,2,8,linear.in_features // 32,2,4,4,)
, .permute(0, 4, 3, 6, 1, 5, 2, 7)
,and .permute(0, 1, 2, 3, 5, 6, 7, 4)
operations are a little confusing to me , even after I read your paper.layer.qweight
parameter, including here, and this may be related to my first question.@synxlin
Hi, thanks for the great work. If there is a tutorial about how to create a custom config file, like awq.yaml. I am wondering what is the meaning of each parameter, like s/n/z int.