Open LLukas22 opened 1 year ago
I was curious if it is currently possible to point the embedding server to a "*.gguf" file
No that's not possible to do yet. It was removed from the first release as I want to do more tests.
and should perform better on a cpu only server.
That's the claim I want to explore. In my tests, some model performed way worse and it's not 100% clear why.
[quant model] doesn't currently support batched inputs, which probably hurts throughput quite a bit
I am not sure it has any effect. You are already compute bound without batching so it shouldn't really impact throughput.
But overall that's for sure something that we want to add. I just want to make sure we understand the advantages and drawbacks beforehand.
That's the claim I want to explore. In my tests, some model performed way worse and it's not 100% clear why.
I would greatly appreciate any benchmarks/test of the different quantization schemes on a bigger embedding dataset. I only played around with Q8
and Q5K
quantizations on a very small datasets so it's likely my findings are simply just wrong/incorrect.
I am not sure it has any effect. You are already compute bound without batching so it shouldn't really impact throughput.
Good point, i just thought that the batching would enable rayon
to better distribute the matmul_t
across the available threads. But i'm not nearly knowledgeable enough to be sure about any of this.
Any updates on this?
Hi! any updates on quant support? I can support with testing, i'm not specialist on embeddings but we can define the metric for evaluation, ex: some kind of distribution of distances between each embedding's vectors of the orginal model and quant model, or other metric for testing this use case.
I'm really looking forward to this feature.
Feature request
Since i spotted bert_quant.rs in the candle backend i was curious if it is currently possible to point the embedding server to a "*.gguf" file and load a quantized model.
Motivation
Quantized models are often way smaller and should perform better on a cpu only server. I recently played around a bit with quantized BERT-like models and achieved 5x reduction in model sizes with only marginal hits to model performance. I guess it would also be a good thing for "serverless" deployments as a 25MB models should be simpler to distribute than a 200MB one 🤔.
Your contribution
At a second look the quantized bert model doesn't currently support batched inputs, which probably hurts throughput quite a bit, i guess that was caused by
matmul_t
not natively supporting batched inputs, which could be solved by reshaping the inputs.