microsoft / TransformerCompression

For releasing code related to compression methods for transformers, accompanying our publications
MIT License
354 stars 31 forks source link

Refactor QuaRot [WIP] #137

Closed nailimixaM closed 4 months ago

nailimixaM commented 4 months ago

This structure allows us to swap out modules like QuarotFP16* modules (which emulate INTX quantization in FP16) for real ones which use CUDA kernels, like Linear4bit.

It also gets rid of any monkeypatching business and is overall much clearer to model implementers, though for now this requires a fair bit of implementation.

nailimixaM commented 4 months ago

@sashkboos let me know what you think of this, especially if I'm missing something - currently getting NaN ppls.