rustformers / llm

[Unmaintained, see README] An ecosystem of Rust libraries for working with large language models
https://docs.rs/llm/latest/llm/
Apache License 2.0
6.06k stars 350 forks source link

Better embedding extraction #295

Open LLukas22 opened 1 year ago

LLukas22 commented 1 year ago

As pointed out in https://github.com/rustformers/llm/pull/291, the quality of embeddings produced by the models at present appears to be suboptimal.

Our current approach uses the embedding of the final token as a representation for the entire input sequence, which might lead to the omission of some semantic information. The approach employed by SGPT: GPT Sentence Embeddings for Semantic Search offers an alternative: they use a weighted mean sampling method to amalgamate the embeddings of all tokens in the input sequence. According to the MTEB-Benchmark, this method results in superior embeddings.

So, this poses the question: should we integrate this method into our implementation? Alternatively, should we leave it to users to manually extract the embeddings for each token and carry out the calculations themselves?

philpax commented 1 year ago

Good catch! I think we should integrate this, but separate it from the existing embeddings. I'm also not sure how we best expose this. Any ideas for API changes that are understandable and restricted to only where it makes sense? This would only make sense with feed_prompt, right?