log_int_softmax int64 问题

tpoisonooo commented 2 years ago

你好，我在把 FQ-ViT 移植进 ncnn， WIP 分支在 https://github.com/tpoisonooo/ncnn/blob/75061d9d46654a4abf52969ea6bfe53698177db9/src/layer/multiheadattention.cpp#L181 ，还在调试...

计算 log_int_softmax 的时候，这一句可能会超过 int32

exp_int, exp_scaling_factor = int_polynomial(r, scaling_factor)
exp_int = torch.clamp(torch.floor(exp_int * 2**(n - q)), min=0)

例如 [x,r,exp_int 0 ,0 ,60129542144 ]

这时候不得不上 int64_t（后面算 sum 要额外开辟内存），想问下有没有办法能不超过 int32_t ？

tpoisonooo commented 2 years ago

我猜测是要调整 n ?

tpoisonooo commented 2 years ago

WIP 分支在 https://github.com/Tencent/ncnn/pull/3940 ， build + 测试中。

Longday0923 commented 2 years ago

感谢您的细致发现和实验，抱歉回复的有些慢。

我在实现的时候是直接从I-BERT的Int-Exp拿过来的写法:[link]。

当时我们研究他们的Int-Exp时候对这里的实现也比较疑惑。

他原本的n的设计我猜测是为了把右移q位改成左移，然后扔出来一个1/2**n的scale，这样等价。原因可能是觉得右移更容易带来截断误差。

我们也做了实验，他这样设计有一定概率超32bit。在n的取值上我们也ablation study了一下，在vit/deit/swin上取n=20应该只会带来~0.1%的掉点。这样应该就不会超过32bit了。

需要注意的是：这种实现中其实有一个截断的操作，来保证n一定大于q：(他的代码)[link]也就是我们代码这里([link])，n取小值会导致截断的部分更多，是否会有其他的影响还不清楚。

Longday0923 commented 2 years ago

之前做的一些实验：

Model	Quantization Method	W/A/Attn Bits	Top1	Top5
DeiT-T	Full Precision	32/32/32	72.21	91.13
	Our w i-softmax n=30	8/8/4	71.03	90.45
	Our w i-softmax n=15	8/8/4	71.15	90.46
	Our w i-softmax n=40	8/8/4	70.95	90.48
	Our w i-softmax n=20	8/8/4	70.87	94.75
	Our w i-softmax n=10	8/8/4	70.96	90.34
DeiT-S	Full Precision	32/32/32	79.85	94.98
	Our w i-softmax n=30	8/8/4	78.40	94.27
	Our w i-softmax n=15	8/8/4	78.46	94.26
	Our w i-softmax n=40	8/8/4	78.440	94.36
	Our w i-softmax n=20	8/8/4	78.444	94.41
	Our w i-softmax n=10	8/8/4	78.85	94.47
DeiT-B	Full Precision	32/32/32	81.84	95.59
	Our w i-softmax n=30	8/8/4	81.01	95.13
	Our w i-softmax n=15	8/8/4	80.94	95.13
	Our w i-softmax n=40	8/8/4	80.92	95.17
	Our w i-softmax n=20	8/8/4	80.93	95.15
	Our w i-softmax n=10	8/8/4	80.90	95.23
ViT-B	Full Precision	32/32/32	84.54	97.32
	Our w i-softmax n=30	8/8/4	82.54	96.57
	Our w i-softmax n=15	8/8/4	82.482	96.50
	Our w i-softmax n=40	8/8/4	82.476	96.50
	Our w i-softmax n=20	8/8/4	82.69	96.58
	Our w i-softmax n=10	8/8/4	82.62	96.58
ViT-L	Full Precision	32/32/32	85.81	97.82
	Our w i-softmax n=30	8/8/4	84.90	97.41
	Our w i-softmax n=15	8/8/4	84.84	97.34
	Our w i-softmax n=40	8/8/4	84.79	97.36
	Our w i-softmax n=20	8/8/4	84.90	97.41
	Our w i-softmax n=10	8/8/4	84.75	97.38
Swin-T	Full Precision	32/32/32	81.35	95.53
	Our w i-softmax n=30	8/8/4	80.44	95.16
	Our w i-softmax n=15	8/8/4	80.39	95.16
	Our w i-softmax n=40	8/8/4	80.33	95.16
	Our w i-softmax n=20	8/8/4	80.42	95.21
	Our w i-softmax n=10	8/8/4	80.49	95.19
Swin-S	Full Precision	32/32/32	83.22	96.31
	Our w i-softmax n=30	8/8/4	82.63	96.13
	Our w i-softmax n=15	8/8/4	82.61	96.10
	Our w i-softmax n=40	8/8/4	82.64	96.10
	Our w i-softmax n=20	8/8/4	82.71	96.15
	Our w i-softmax n=10	8/8/4	82.71	96.14
Swin-B	Full Precision	32/32/32	83.59	96.46
	Our w i-softmax n=30	8/8/4	82.798	96.132
	Our w i-softmax n=15	8/8/4	82.802	96.134
	Our w i-softmax n=40	8/8/4	82.76	96.11
	Our w i-softmax n=20	8/8/4	82.77	96.15
	Our w i-softmax n=10	8/8/4	82.67	96.02

结果发现n<30的时候效果会好，这个结论不太符合直觉，因为小n会对原feature做截断。

您可以先试一下减小n符合位宽之后，对于精度的影响。

tpoisonooo commented 2 years ago

啊，写了版 C++ 的 lis ，单测和 python 版本能对上分（测试用例跑了 1w 个，应该算对上了吧..） C++ 版在这里： https://github.com/tpoisonooo/cpp-syntactic-sugar/tree/master/log-int-softmax

然而 C++ LIS 放到 ViT mha 里，误差会一层一层扩大。起初 mha0 误差很小，到了 mha4 数值就明显不对了。我的模型只量化了 mha，别的 opr 都是纯浮点。

精度会扩大的是这里的 forward_int8，里面会用论文想表达的 int8 << uint4 和 find_first_one 操作： https://github.com/tpoisonooo/ncnn/blob/aa6e7918c655b33b469c3b7b9f19d86ac820057d/src/layer/multiheadattention.cpp#L369

修了几天实在修不明白了，改成 int8 mha no LIS，只量化里面几个 GEMM 就没啥事.... 正常的代码是这个 forward_int8_v2: https://github.com/tpoisonooo/ncnn/blob/aa6e7918c655b33b469c3b7b9f19d86ac820057d/src/layer/multiheadattention.cpp#L609

我太菜了，一周白干~

tpoisonooo commented 2 years ago

唉，继续用 forward_int8，把 N 改成 10，数值就好很多！

浮点版 GT

softmax result: 65 0.98xxxx

只量化 mha N=10

softmax result: 65 0.820818

只量化 mha N=30，修了几天没修明白的

softmax result: 646 0.152285

tpoisonooo commented 2 years ago

首先我就是写 C++/kernel 的，不炼丹。

量化对不对，全靠肉眼看数值... 满屏幕的数字

总结一下近一周看数值的主观感受，不一定对：

1）LIS 感觉比较虚，因为 LIS 结果的累加和不是 1.0 。 12 轮 MHA 后一定是缩放结果的； 2）mha input 首先要 affine qkv， diff<affine_v_int8, affine_v_fp32> 的误差，感觉比 qk 的要大一点。

因为，affine_v_int8_dequant 有几个正负号翻转了， qk 没发现这种现象。GEMM 计算里，符号的变化比大小变化致命。

Longday0923 commented 2 years ago

您好！我是FQ-ViT的作者之一，现在在旷视aic量化组实习，paper中主要负责lis这一部分，您方便咱们加微信细聊一下吗？7月1日的时候我给您github留的邮箱发了一封邮件，内有我的微信号。

非常感谢您在部署工作上面的贡献，其实这也是我们当时觉得这篇工作短缺的部分。我们理论上认为可以利用到“整数运算”、“移位运算”这些的优点，但是当时没能力在真实硬件上实现和测速，想和您多交流一下，取取经。

megvii-research / FQ-ViT

log_int_softmax int64 问题 #21

结果发现n<30的时候效果会好，这个结论不太符合直觉，因为小n会对原feature做截断。