This adds a modified Marlin fp16/int4 kernel to the library and creates two new QTensor subclasses to use it:
MarlinInt4PackedTensor,
MarlinInt4WeightQBitsTensor.
There are issues with the weight/scales/zero-point readback as soon as parallelization increases. The consequence is that output features higher than 128 are corrupted when a sufficient amount of inputs are parallelized.
As a consequence, the AWQ kernel is still used despite lower performances as the number of tokens increases.
The code is however merged as is, and #332 is created to investigate the issues.
What does this PR do?
This adds a modified Marlin fp16/int4 kernel to the library and creates two new QTensor subclasses to use it:
There are issues with the weight/scales/zero-point readback as soon as parallelization increases. The consequence is that output features higher than 128 are corrupted when a sufficient amount of inputs are parallelized.
As a consequence, the AWQ kernel is still used despite lower performances as the number of tokens increases.
The code is however merged as is, and #332 is created to investigate the issues.