huggingface / optimum-quanto

A pytorch quantization backend for optimum
Apache License 2.0
833 stars 62 forks source link

Add marlin int4 kernel #333

Closed dacorvo closed 1 month ago

dacorvo commented 1 month ago

What does this PR do?

This adds a modified Marlin fp16/int4 kernel to the library and creates two new QTensor subclasses to use it:

There are issues with the weight/scales/zero-point readback as soon as parallelization increases. The consequence is that output features higher than 128 are corrupted when a sufficient amount of inputs are parallelized.

As a consequence, the AWQ kernel is still used despite lower performances as the number of tokens increases.

The code is however merged as is, and #332 is created to investigate the issues.