Queation about config file

mxjmtxrm commented 4 months ago

Hi, thanks for the great work. If there is a tutorial about how to create a custom config file, like awq.yaml. I am wondering what is the meaning of each parameter, like s/n/z int.

mxjmtxrm commented 4 months ago

Here is another question about smooth code. Why do these operations on y_range？ especially view and expand. https://github.com/mit-han-lab/lmquant/blob/c5c10897da4957cba47dd798080f5b2e1321f474/lmquant/quant/calib/calibrator/smooth.py#L289-L292

synxlin commented 4 months ago

Hi @mxjmtxrm, thanks for your interest in our QServe and LMQuant! About the configuration file, I am working on a detailed documentation introducing all parameters.

As for your question about qdtype,

sint8 indicates symmetric signed INT8, quantization range is from [-127, 127]
zint8 indicates asymmetric signed INT8 (i.e. with zero point), quantization range is from [-128, 127] with signed INT8 zero point
uint8 indicates unsigned INT8, quantization range is from [0, 255]
nint8 indicates asymmetric unsigned INT8 (i.e., with zero point), quantization range is from [0, 255] with unsigned INT8 zero point. zint8 and nint8 is mathematically the same (with an offset of 128).
you can see more details here.
All configuration has its own Config class. Currently, you can refer to the docstrings of these config class (e.g., WeightQuantizerConfig).

As for your question on SmoothAttention,

RoPE (Rotary Position Embedding) in transformers is implemented in a manner that channel i and channel i+D/2 are paired (see here).
In order to make SmoothAttention scale factor commutative with RoPE, we have to make sure scale[i]==scale[i+D/2].
SmoothAttention scale is determined by the per-channel activation range. In the code, y_range is a tensor with shape (C,) and can be viewed as (num_heads, 2, head_size / 2) since C = num_heads * head_size.
To make scale[i]==scale[i+D/2], we perform reduction on the dim=1 on the Line 291. After reduction, y_range is in the shape of (num_heads, 1, head_size/2). By expanding to the shape (num_heads, 2, head_size/2), we let y_range[i]==y_range[i+D/2], which will lead to scale[i] == scale[i+D/2].

mxjmtxrm commented 3 months ago

I got it. Thanks.

brisker commented 3 months ago

@synxlin

It seems that the config yaml isn't actually used by the run file here。

I simply use the yaml file here and just modify enabled_smooth from true to false, and when I debug the code, the flag enabled_smooth seems to be still true. Is there any minor bugs?

synxlin commented 3 months ago

Hi, @brisker. How did you set enabled_smooth to false? To disable smooth, you can set enable_smooth (not enabled_smooth) to false in the yaml file.

brisker commented 3 months ago

@synxlin Hi, enable_smoothset and problem solved.

Besides, after I run the w8a8-per-tensor static kv quantization for llama2-13b using this line code, I got a folder containing model.pt and scale.pt.

But the weird thing is that I can not find any quantization parameters corresponding to kv8 static per-tensor quantization, so after I convert this folder using this code, and run the generated files by this code to get w8a8kv8 benchmarking running speed, I am not even sure, whether it's w8a8-with-per-static-kv8-quantization or w8a8-with-dynamic-per-token-kv8-quantization. It seems that kv8 quantization parameters has not been used by QServe, nor has it been generated by lmquant and saved.

synxlin commented 3 months ago

Hi, @brisker. You are right about this since I did not save the per-tensor scale factor of KV cache in the scale.pt. We are still working on cleaning our code for all cases (W8A8KV8-static, etc.)

brisker commented 3 months ago

@synxlin So QServe can not support static kv quantization right now, right?

ys-2020 commented 3 months ago

Hi @brisker , thank you for your interests in QoQ and QServe. Yes. In QServe, we mainly support dynamic quantization in the current version, and the scaling factors/zeros for kv quantization are also stored in the kv page. We found that dynamic quantization for kv cache achieves better accuracy-efficiency trade-offs.

brisker commented 3 months ago

@synxlin Thanks for your prompt reply.

Besides, when I was reading the codes of QServe, I found that you put the rotary_pos_emb and attention in one cuda function, and even the first token generation and other following token generation do not share the same cuda function, which uses this and this respectively.

So my question is:

why do you couples rotary_pos_emb with attention? It seems that if qserve needs to support more models(for example, internLM2 or Qwen, and if more kinds of rotary_pos_emb happens, this coupling can make the codes not flexible. In VLLM here and here, these two are seperated.
what is the difference between the two functions mentioned above.(first token and non-first-token generation cuda function)
Why adding q_rotary_emband k_rotary_emb for attention module here? I have not seen this in vllm.

synxlin commented 3 months ago

Hi, @brisker. For your questions,

Fusing rotary_pos_emb into attention kernel can help reduce the number of kernel calls and achieve better throughput. This is a widely adopted solution in current efficient LLM systems, e.g., TensorRT-LLM.
The first token is generated in the so-called prefilling stage where computations in Attention are GEMM while non-first-token generation is the decoding stage where computations in Attention are GEMV.
In order to perform fake post-RoPE quantization on KV cache without rewriting Attention module, we have to monkey patch the original attention by adding q_rotary_emb and k_rotary_emb whose outputs are post-RoPE Q and K.

brisker commented 3 months ago

@synxlin

w4a8kv4-gs128 means per-channel-static-w4 per-token-dynamic-a8 per-head-dynamic-kv4, right?

In your latest version of paper, is there any minor slip of pen in the following part ?

synxlin commented 3 months ago

@brisker Thank you for helping us find this typo. We will fix it in the next version of our paper.

brisker commented 3 months ago

@synxlin

In your paper, you mentioned dynamic per_head kv_cache quantization with scales and zero_points. Given a kv_cache value with shape[batchsize, attention_head_nums, token_num, hidden_dim/attention_head_nums], is your quantization scales and zero_points in shape: [batchsize, attention_head_num,token_num]???

( I run your kv4 quanti codes on wikitext test on llama, the float kv_cache data is in shape [2048,4096], and the scales are in shape [2048,1,32,1] )

If this is true, you seems to only quantize every hidden_dim/attention_head_nums tensor elements into 4bit with the same quanti scales , and hidden_dim/attention_head_nums usually comes at 128 for llama, anything I said wrong?
In Table.3, the zero-shot accuracy, is the accuracy Llama-1 or Llama-2 ?
In Table.3, is the kv4 quantization scales and zeros got just by simple pytorch amin and amax operation, no other tricks?
I am getting some firewall issues, how to evalute zero-shot accuracy on your codes? It seems that the datasets are downloaded online( I have all the data offline on disk), but I can not find where to use the datasets.load_from_disk()function

synxlin commented 3 months ago

@brisker, for your questions:

You are right about KV cache quantization. Every hidden_dim/attention_head_nums elements share the same quantization scale and zero point.
The zero-shot accuracy in Table.3 is of Llama-2 (You can see from the top-left cell in the table).
Yes, the KV cache quantization is simple RTN quantization after SmoothAttention.
We currently use lm_eval to evaluate zero-shot accuracy; you check code here. You can refer to lm_eval documentation to see how to parse the local dataset to the simple_evaluate().

brisker commented 3 months ago

Thanks for your prompt reply!

I notice that the w4a8 model files size is half of that of w8a8, and w8a8 is half that of fp16, but the file dtype is still int8 in w4a8kv4 models, and the element num is half of w8a8 and fp16. So you just store N/2 int8 numbers to represent N int4 numbers( original tensor numbers are N), right? This is necessary, due to there is no int4 storage dtype now.
1. You seems to achieve this here. The reshape(linear.out_features // 32,2,2,8,linear.in_features // 32,2,4,4,), .permute(0, 4, 3, 6, 1, 5, 2, 7) ,and .permute(0, 1, 2, 3, 5, 6, 7, 4) operations are a little confusing to me , even after I read your paper.
  -------------In my understanding, one int8 number can be simply splitted into two int4 numbers by a high-4bit part and a low-4bit part, right? -----------Why there are complicated permute operations here?
(The permute order is also a little weird, not many related descriptions in the paper about this) ,Why that specific reshape and permute order?
In your paper, you mentioned dequantization for w4a8 GEMM: but I found in your codes, after loading the w4a8 model just before running inference with prompts with QServe kernels, the model_runner got qweight with int8 weights, so why you call it dequantization for w4a8? If so, did you mean first w8-> w4, and then w4->w8( and this w4->w8 is what you called--dequantization for later GEMM)？？ Because from the beginning, your model have no real int4 weights, but int8 weights in thelayer.qweight parameter, including here, and this may be related to my first question.

@synxlin

mit-han-lab / lmquant

Queation about config file #2