[FR] RAG: Add support for Int8 embeddings

svilupp / PromptingTools.jl

Streamline your life using PromptingTools.jl, the Julia package that simplifies interacting with large language models.

https://svilupp.github.io/PromptingTools.jl/dev/

MIT License

123 stars 13 forks source link

[FR] RAG: Add support for Int8 embeddings #118

Open svilupp opened 6 months ago

svilupp commented 6 months ago

It would be great to have support for embeddings compressed to Int8 as per HuggingFace: Embedding Quantization.

Potential implementation would be to:

define an embedder (<:AbstractEmbedder for get_embeddings) and the corresponding finder (<:AbstractSimilarityFinder for find_similar)
Both would have the vectors with necessary min_values and max_values fields to hold the effective range for each embedding dimension (eg, length(min_values)=length(max_values)=D)
define methods for these types
The conversion to Int8 could be done post hoc (after build_index) via a utility function and then the resulting finder with the range to allow converting to Int8 (to be provided to the airag)
It should implement the two-stage pass with rescore_multiplier=4 (first on Int8 embeddings, then with Float x Int8)

pabvald commented 2 weeks ago

I am going to take care of this one.

I would happily help to move the RAG functionality into a separate package too. Let me know if you want to move forward with that.

The package LinLogQuantization.jl has a pretty neat implementation of linear quantization to unsigned types (UInt8, UInt16, ...). An extension to include signed types would be relative easy but also more work. What do you think about first providing support for unsigned-integer embeddings, and later on extend it to signed integers?

svilupp commented 2 weeks ago

I was hoping to do the RAGTools migration after we merge in the Pinecone support, I don't suppose you would be interested in finishing that?

On the Int8, cool! You can probably re-use a lot from the "bitpacked" embedder. I don't mind if it's signed or not.

On the dep addition, where do you see the benefits to outweigh the costs. It's just a minor performance tweak (no big gains in any direction compared to what we have already), so I'm not sure we would need to support more than one simple implementation of this. Do you have a different view?

pabvald commented 2 weeks ago

I would like to implement this first, since I have already invested some time on it. I can take care of the Pinecode support afterwards, it that's stopping the RAG package from being born.

LinLogQuantization.jl implements exactly what we need and nothing else. It's a very small package (less than 300 lines of code) and I cannot see a simpler implementation of linear quantization. In my opinion, anything else than using the package would be wasting effort in reinventing the wheel.

If you really want to avoid the dependency, we could take only the part of the package that implements linear quantization to avoid adding the code for logarithmic quantization.

pabvald commented 2 weeks ago

By the way, here's a more detailed explanation of scalar quantization. It's a reference in the article you provided

svilupp commented 1 week ago

Sorry for the slow response! I was at a hackathon the whole weekend.

I don't think it would be appropriate to add LLQ (with StatsBase as a dep) as a direct dependency of PromptingTools for everyone. RAG is used by only a subset of PT users, within that subset only a few users will ever look at quantization, within that picking Int8 is quite a niche (the trade-offs are quite nuanced and it's probably not worth it for most).

In addition, if we'll only ever have Int8 (I don't see any benefit from having more Int versions - there are more low-hanging fruits to get performance), it's just 2-3 functions we need, so it's a very simple problem to solve directly.

If you still insist on using the LLQ package, I'd ask you to add it as an extension (weak dep). Then I'm happy to review the PR.

EDIT: If you're super excited to drive a lot more effort in the quantization space and speed up the in-memory embeddings, we could look into shaping that as a sister package that people could just import and get a bunch of different performance optimizations!

pabvald commented 1 week ago

Understood. I have extended the package for support linear quantization of signed integers (see PR). I can copy the two necessary functions to implement the Int8 index without adding the package as dep.