Model quantization - Githubissues

Hello ! Great question ! For the moment we haven't tested too much pure quantization of the model itself. We/other people did try however inference optimization techniques such as token pooling or embeding binarization. Both work great and combining them can save up up to 96x the memory footprint for negligeable cost ! https://twitter.com/jobergum/status/1826682421498003722

Having said that, it's a hugging face model so to quantize you can basically just load them in a lower precision, or use HF scripts like you would for other model. You should probably test out the perf drop if you do this though ! If you do any tests let us know !

Cheers, Manu

illuin-tech / colpali

Model quantization #21