[Feature] Return embeddings

darth-veitcher commented 1 year ago

As title indicates I'd be interested in understanding whether this is just for text-generation or whether it could also be used to expose the embedding function?

OlivierDehaene commented 1 year ago

For now it does not return the embeddings but this could be added in the future.

darth-veitcher commented 1 year ago

Ah great. Thanks for the response @OlivierDehaene. The embeddings would be of interest for indexing content and subsequently using a vector store.

OlivierDehaene commented 1 year ago

Do you return an embedding for each token? I am not the most familiar with this use case.

darth-veitcher commented 1 year ago

I’m specifically looking at the use case of indexing content and storing in something like pinecone or OpenSearch for subsequent querying and retrieval.

Langchain has a good overview in their indexes documentation but essentially:

for each file;
split into chunks;
calculate embeddings for chunks;
save to vectorstore

As a result I’d need to have an embedding function available both for the initial calculation and storage and then at a later point to assist with replicating for the query.

I think this is quite a common use case and pattern but I could be wrong.

sonsai123 commented 1 year ago

Really looking forward for this feature.

darth-veitcher commented 1 year ago

Any update on this in terms of priority, effort, timeline @OlivierDehaene ? Appreciate all the work so far and can see there's been a lot of commits since this was originally raised !

sam-ulrich1 commented 1 year ago

Not sure if you all were looking for the return of embeddings from a decoder model (hidden state) or a dedicated implementation for things like sentence-transformers but I started a fork of this repo to work with sentence transformers.

It doesn't have model sharding or NCCL comms right now since none of the models in sentence-transformers are that large but hopefully we will support that some day!

https://github.com/Gage-Technologies/embedding-server

M-Chris commented 11 months ago

Would love to see an /embeddings endpoint for use with vector DB's like Pinecone, Weviate, Faiss, Milvus etc.

Hopefully a gentle bump and inspiration helps :)

Here's a couple references for inspiration: https://milvus.io/docs/integrate_with_hugging-face.md https://platform.openai.com/docs/api-reference/embeddings

jon-chuang commented 11 months ago

Hello, I am interested in implementing this feature. Any tips on the best pathway would be appreciated.

The focus would be around serving transformer-based dense embeddings.

Narsil commented 11 months ago

AFAIK, embeddings usually use very different models, and have very different properties. Including something here therefore doesn't make a whole lot of sense.

sentence-transformers https://www.sbert.net/ is the basic way to go, no ? (with opensource models unlike OpenAI embedding models which we can't ever serve).

There might be ways to create optimized servings, but lauching a simple flash server in front of sentence-transformers should be enough no ? Those models are usually dirt small compared to LLMs.

jon-chuang commented 11 months ago

Hi @Narsil , I suppose you may be right that this is not necessarily the best framework. I was hoping for out of the box experience with:

built in http endpoints
queuing
batched inference
optimized concurrent serving (including choosing the right concurrency, choosing the right serving runtime e.g. ONNX)
huggingface (& sentence-bert) integration.

There is an article by Vespa.ai on optimizing concurrent serving. Any tips on the right framework for serving embeddings (esp integrated with huggingface) would be appreciated.

jon-chuang commented 11 months ago

That being said, if no framework exists which fits these requirements, it doesn't sound far fetched that one may build upon the work in this repo. Serving sentence-bert models would be necessary.

sam-ulrich1 commented 11 months ago

That being said, if no framework exists which fits these requirements, it doesn't sound far fetched that one may build upon the work in this repo.

If you look up in the thread you'll see a link to a project specifically for what you're asking. It's an embedding server derived from this repo. We'd love your help improving it. Right now it does get maintenance but on an as needed basis. With that said it does work for any model that can be used with the sentence-transformers library. What we really need is to finish the actions pipeline to roll out docker images. If you manually build the docker it will run.

jon-chuang commented 11 months ago

@sam-ulrich1 I am simply afraid that since it is currently maintained for a single company, there is not enough visibility and long term support to be worth investing in it and recommending it to users of LLM applications frameworks (such as llamaindex).

It would be great if you could break down what you have managed to achieve with your fork and whether there might exist a pathway to merging it into this repo. Of course @Narsil and @OlivierDehaene would have to agree that it is a useful enough feature.

It does seem that quite a handful of users are interested in it. If the pathway is not complex, it seems like a win for the community.

sam-ulrich1 commented 11 months ago

Hate to say it but there's no chance it would get merged. It's a hard fork. With that said, the easiest way to make sure it stays supported is to help out!

jon-chuang commented 11 months ago

Hate to say it but there's no chance it would get merged

What I mean of course is to extract out the key changes and to contribute PRs to this repo.

sam-ulrich1 commented 11 months ago

Given the breadth of changes from casual language modeling to encoder models I don't think it's likely that the team here at text-generation would accept a PR for it (I don't speak for them just speculating).

Embedding generation works very differently from the intended use case of this repo. With that said, you (or anyone else) is more than welcome to look over our repo and make a PR

Narsil commented 11 months ago

Code complexity for something relatex to embeddings should be.. .MUCH smaller (there's no decode, no past key values, no paged attention).

I think flash attention would be the main asset and classic dynamic batching should work great.

jon-chuang commented 11 months ago

Ok, thanks folks. I will look into simpler solutions and look out out for flash attention and dynamic batching.

RonanKMcGovern commented 10 months ago

@jon-chuang I'm not sure if this is what you were thinking but probably it's easiest to add embeddings mostly in parallel (rather than deeply inbuilt) to tgi.

use a model fine-tuned or prompted for function calling, specifically with a function called search_vector_database, which would have an input argument that would be the user's message.
write a search_vector_database function on the server side so that it vectorises (probably with a simple tokenizer like sentencepiece) the user's message and then does a cosine similarity search of whatever pre-vectorised docs you want (which would have to also be server side)
modify the tgi (or, probably easier, the chat-ui or other ui code) code so that it checks the assistant response for a function call with search_vector_database in it and, if so, makes a call to that function. The result from the cosine similarity search should then automatically be fed back in (along with the user's query) to the language model. Lastly, the LLM's response would be directed to the UI for the user to see it.

It may actually be better to handle all of this logic in the UI code, for example in chat-ui, and just use tgi as an api for feeding in input text.

OlivierDehaene commented 9 months ago

We will ship a new serving container that will only do text-embeddings focused on serverless, dynamic batching and using our new Candle library in the coming weeks.

Benvii commented 9 months ago

hi @OlivierDehaene seems really interesting, do you have any target release date for this text-embeddings server ? We would be glad to beta test it at Credit Mutuel Arkea :)

RonanKMcGovern commented 9 months ago

btw @OlivierDehaene what layers are you using for the embeddings? Just the first layer?

Narsil commented 9 months ago

@RonanKMcGovern embeddings are done through dedicated models here is a leaderboard we have for these: https://huggingface.co/spaces/mteb/leaderboard (Always take leaderboards and benchmark with a pinch of salt, your use case is rarely the benchmark under test)

@Benvii nice !

It's coming up nicely for now !

RonanKMcGovern commented 9 months ago

Ok, thanks @Narsil , I naively just tested by using the first layer of llama and it works pretty ok, but yeah I imagine specialised models are better.

rahermur commented 9 months ago

Hi @OlivierDehaene,

We are planning the deployment of self-hosted models using text-generation-inference, but we would love to have the new service that you just mentioned for text-embeddings. Our complete use case is something that requires RAG with open source LLMs and embeddings. Here at Adyen we would also like to be early adopters or beta testers. Also we would like to contribute back to the project if there is an opportunity for it.

Narsil commented 9 months ago

@rahermur @OlivierDehaene is finishing it up, but we're seeing quite nice performance atm and we're leveraging candle for maximum performance (embeddings models tend to be small and therefore CPU bottleneck is even more noticeable than with LLMs)

michaelfeil commented 8 months ago

FYI, I just created a small project called infinity. Its a lightweight async implementation using fastapi and pydantic for input validation, torch and ctranslate2 under the hood, performs dynamic batching via async, under MIT Licence.

ludwigprager commented 8 months ago

FYI, I just created a small project called infinity. Its a lightweight async implementation using fastapi and pydantic for input validation, torch and ctranslate2 under the hood, performs dynamic batching via async, under MIT Licence.

Hi,

Great stuff. Looks like this could be the solution. I yet didn't make it. Two questions:

When I run a curl from swagger I get an oddly huge reply. What's wrong there pls?:
how do I find out which (mini) model to select? How do I see from e.g. https://huggingface.co/h2oai/h2ogpt-4096-llama2-7b-chat or https://huggingface.co/h2oai/h2ogpt-gm-oasst1-en-2048-falcon-7b which embedding was used . I guess I will then be able to search a small model to run 'infinity' with. (Yes, this is a general question. Would be great if it was answered here anyway)

BTW this is my docker-compose snippet:

  infinity:
    image: michaelf34/infinity:latest 
    ports:
      - 8081:8080
    volumes:
      - ./torch:/app/.cache/torch/
    command: --model-name-or-path sentence-transformers/all-MiniLM-L6-v2 --port 8080 --engine ctranslate2
--engine ctranslate2
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: ["gpu"]

and these are a few lines from my langchain app. I used the OllamaEmbeddings. (Therefore, I also need a slight http-rewrite rule and that's why the port doesn't match the docker compose).

import { OllamaEmbeddings } from "langchain/embeddings/ollama";
const embeddings = new OllamaEmbeddings({
  model: "all-MiniLM-L6-v2", // default value
  baseUrl: "http://" + process.env.LLM_ADDRESS + ":80", // default value
});

Someone have a better solution in mind? Not urgent, looks like this is working though not too nice.

michaelfeil commented 8 months ago

@ludwigprager Did not want to "hijack" this issue, in case you have questions -> https://github.com/michaelfeil/infinity/issues

td;lr: Thanks for the docker. Sorry, wrong usage. Infinity is a drop-in replacement for e.g. openai-embeddings. https://platform.openai.com/docs/guides/embeddings/what-are-embeddings. In short, for text-embeddings, you want to deploy any of those models (not falcon/llama!!!): https://huggingface.co/spaces/mteb/leaderboard. The result you see are vectors, not auto-regressive text.

OlivierDehaene commented 8 months ago

@rahermur, @Benvii,

We just released https://github.com/huggingface/text-embeddings-inference. For now, the scope is pretty limited however it is the fastest/cheapest solution that I am aware of to serve the MTEB top 5.

It offers:

OpenAI compatible interface
Insane speeds without a compilation step
Token based dynamic batching
Optimized transformers code for inference using Flash Attention, Candle and cuBLASLt
Production ready (distributed tracing with Open Telemetry, Prometheus metrics) as always
Very small docker images and very fast downloads to truly enable serverless workflows

Feel free to try it out and report issues/ask for new features on this repo :)

Cheers!

sam-ulrich1 commented 8 months ago

@rahermur @OlivierDehaene @Narsil Just tested the new TEI... Damn that's snappy!

bge large: ~1.5Gib VRAM @ 512t - full queue - default settings

----- Global Metrics -----
Total Requests: 4000
Successful Requests: 4000
Failed Requests: 0
Average Time Taken: 0.3564 seconds
Throughput: 111.93 requests/sec

Many thanks for the hard work!

huggingface / text-generation-inference

[Feature] Return embeddings #199