qdrant / fastembed

Fast, Accurate, Lightweight Python library to make State of the Art Embedding
https://qdrant.github.io/fastembed/
Apache License 2.0
1.31k stars 94 forks source link

Quantization Investigation #126

Open NirantK opened 6 months ago

NirantK commented 6 months ago

Consider this model from Xenova, there is a quantized model which is 120M instead of the 440-450M which I get from O3 quantization from Optimum.

Compare if the quantized model is as good as the 450M, O3 with an atol of 1e-3 and an O2 of 1e-4 — or there is something else happening there?

NirantK commented 6 months ago

See Static & Dynamic quantization here: https://huggingface.co/docs/optimum/onnxruntime/usage_guides/quantization

NirantK commented 6 months ago

From @Xenova, this script traverses the graph and collects operators for quantization https://github.com/xenova/transformers.js/blob/main/scripts/convert.py