huggingface / text-embeddings-inference

A blazing fast inference solution for text embeddings models
https://huggingface.co/docs/text-embeddings-inference/quick_tour
Apache License 2.0
2.77k stars 174 forks source link

Are quantized models supported yet? #7

Open LLukas22 opened 1 year ago

LLukas22 commented 1 year ago

Feature request

Since i spotted bert_quant.rs in the candle backend i was curious if it is currently possible to point the embedding server to a "*.gguf" file and load a quantized model.

Motivation

Quantized models are often way smaller and should perform better on a cpu only server. I recently played around a bit with quantized BERT-like models and achieved 5x reduction in model sizes with only marginal hits to model performance. I guess it would also be a good thing for "serverless" deployments as a 25MB models should be simpler to distribute than a 200MB one 🤔.

Your contribution

At a second look the quantized bert model doesn't currently support batched inputs, which probably hurts throughput quite a bit, i guess that was caused by matmul_t not natively supporting batched inputs, which could be solved by reshaping the inputs.

OlivierDehaene commented 1 year ago

I was curious if it is currently possible to point the embedding server to a "*.gguf" file

No that's not possible to do yet. It was removed from the first release as I want to do more tests.

and should perform better on a cpu only server.

That's the claim I want to explore. In my tests, some model performed way worse and it's not 100% clear why.

[quant model] doesn't currently support batched inputs, which probably hurts throughput quite a bit

I am not sure it has any effect. You are already compute bound without batching so it shouldn't really impact throughput.

OlivierDehaene commented 1 year ago

But overall that's for sure something that we want to add. I just want to make sure we understand the advantages and drawbacks beforehand.

LLukas22 commented 1 year ago

That's the claim I want to explore. In my tests, some model performed way worse and it's not 100% clear why.

I would greatly appreciate any benchmarks/test of the different quantization schemes on a bigger embedding dataset. I only played around with Q8 and Q5K quantizations on a very small datasets so it's likely my findings are simply just wrong/incorrect.

I am not sure it has any effect. You are already compute bound without batching so it shouldn't really impact throughput.

Good point, i just thought that the batching would enable rayon to better distribute the matmul_t across the available threads. But i'm not nearly knowledgeable enough to be sure about any of this.

Isaac4real commented 7 months ago

Any updates on this?

puppetm4st3r commented 4 months ago

Hi! any updates on quant support? I can support with testing, i'm not specialist on embeddings but we can define the metric for evaluation, ex: some kind of distribution of distances between each embedding's vectors of the orginal model and quant model, or other metric for testing this use case.

wencan commented 3 months ago

I'm really looking forward to this feature.