Adds tensorization of compute (depends on the PR mentioned above) for AVX512.
Reshape of input/output to handle input with > 2 dims.
Weight reshape that is dependent on 1st dim of weight. That is if weight.shape[0] > 16 we expect shape size to be multiple of 16. Then we pack weight has NK16n4k. If weight.shape[0] < 16 we pack if as NK"weight.shape[0]n4k.
if you expland you get the last term as
lhs_zero_point*rhs_zero_point*k.
However in fbgemm they do not have this addition. It is not clear yet why or something else is missed. We will followup on this. So this is one bug that is fixed.
The second bug is more subtle. In input data quantization fbgemm uses rounding whereas we did casting. Casting by default does truncation. So whenever our quantized value is different it is always smaller by 1. When k dim is larger these get accumulated and with scale applied difference gets even larger. The solution was to use tvm::round, however this costs us 10% in perf. Debug is under way.
Benchmark comparision is against fbgemm implementation.
Without the rounding we are 10% faster with rounding we are on par. Need to figure out why are we losing perf due to rounding.
Tensorize quantized linear. Fixed a bug in the compute. Depends on this PR. https://github.com/facebookexperimental/tvm/pull/7
Specifically this PR does:
if you expland you get the last term as
lhs_zero_point*rhs_zero_point*k
. However in fbgemm they do not have this addition. It is not clear yet why or something else is missed. We will followup on this. So this is one bug that is fixed.tvm::round
, however this costs us 10% in perf. Debug is under way.Benchmark comparision is against fbgemm implementation. Without the rounding we are 10% faster with rounding we are on par. Need to figure out why are we losing perf due to rounding.