ViT-B add ptf reshape_tensor 问题

megvii-research / FQ-ViT

[IJCAI 2022] FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer

Apache License 2.0

301 stars 48 forks source link

ViT-B add ptf reshape_tensor 问题 #19

Closed tpoisonooo closed 2 years ago

tpoisonooo commented 2 years ago

我执行的是 ViT-B

$  python3 test_quant.py  vit_base ./quantdata/ --quant --ptf --lis --quant-method minmax

add opr 的输出 shape 是 [1, 197, 768]，按 channel-wise 的语义， min/max 的 shape 不应是 [197] 么。

为啥 reshape_tensor 做了一次 transpose，导致最后 shape 是 [768]。

后面 quant 里 get_reshape_range 特意用 (1,1,-1) ，感觉也不是个 bug 而是个精妙的设计。

    def reshape_tensor(self, v):
        if not isinstance(v, torch.Tensor):
            v = torch.tensor(v)
        v = v.detach()
        if self.module_type in ['conv_weight', 'linear_weight']:
            v = v.reshape(v.shape[0], -1)
        elif self.module_type == 'activation':
            if len(v.shape) == 4:
                v = v.permute(0, 2, 3, 1)
            v = v.reshape(-1, v.shape[-1])
            v = v.transpose(0, 1)    **为啥这里要 transpose ？**
        else:
            raise NotImplementedError
        return v

linyang-zhh commented 2 years ago

是的！在CNN中，特征的表示是(B, C, H, W)，分别代表BatchSize、Channel、Height、Width；而在Transformer中，特征的表示是(B, N, C)，分别代表BatchSize、TokenNum、Channel。

因此，量化是作用在最后一个维度上的～

tpoisonooo commented 2 years ago

... 竟然还有这种操作...

... 同一个 add opr，取 197 还是 768，完全是看处于哪个网络是么...

linyang-zhh commented 2 years ago

是的😢 是有二义性的。区分activation的类别，一种比较简单的办法是用tensor.dim() == 3 or 4来区分。

感觉这样设计的主要原因是，Conv2d这个算子默认输入(B, C, H, W)，而Transformer主要是Linear，那是输入(B, N, C)

tpoisonooo commented 2 years ago

还有个问题，QAct 里 quantizer 输出是 quant() - dequant()，实际上是个 fp32

    def forward(self, x):
        if self.calibrate:
            self.quantizer.observer.update(x)
            if self.last_calibrate:
                # import ipdb;ipdb.set_trace()
                self.quantizer.update_quantization_params(x)
        if not self.quant:
            return x
        x = self.quantizer(x)   **这句输出实际是fp32**
        return x

一个 block 的子结构是这样：

 add0 ----- layernorm -- multiheadattention -----add1
     \____________________________________________/

实际输出 fp32 在推理过程中会产生 requant 和 dequant，有没有办法让 add1 的两个输入的 scale+zp 拉齐，这样可以 add0 输出 int8 .

add0.out 的 <scale, zp> 直接传给 layernorm，和 layernorm 的计算 fusing

tpoisonooo commented 2 years ago

本意是想规避 add 导致的 requant/dequant/quant 问题。

linyang-zhh commented 2 years ago

目前代码逻辑应该是支持的，因为add1后面还有一个qact，这个qact的scale(下式中的$s_{add1}$)可以往前送。

这样的话，add0本身的uint8可以送给layernorm这个分支，该分支走到add1前的地方需要走一个requant；而送给add1的残差连接分支直接走一个requant（而非dequant）就可以。

FakeQuant 公式： $$ s{add1} \times \lceil \frac{add1}{s{add1}} \rfloor = s{attn} \times \lceil \frac{attn}{s{attn}} \rfloor + s{add0} \times \lceil \frac{add0}{s{add0}} \rfloor $$

等价变换后，则： $$ \lceil \frac{add1}{s{add1}} \rfloor = \frac{s{attn}}{s{add1} } \times \lceil \frac{attn}{s{attn}} \rfloor + \frac{s{add0}}{s{add1} } \times \lceil \frac{add0}{s_{add0}} \rfloor $$