What does this PR do?

This adds a modified Marlin fp16/int4 kernel to the library and creates two new QTensor subclasses to use it:

MarlinInt4PackedTensor,
MarlinInt4WeightQBitsTensor.

There are issues with the weight/scales/zero-point readback as soon as parallelization increases. The consequence is that output features higher than 128 are corrupted when a sufficient amount of inputs are parallelized.

As a consequence, the AWQ kernel is still used despite lower performances as the number of tokens increases.

The code is however merged as is, and #332 is created to investigate the issues.

huggingface / optimum-quanto

Add marlin int4 kernel #333

What does this PR do?