huggingface / text-embeddings-inference

A blazing fast inference solution for text embeddings models
https://huggingface.co/docs/text-embeddings-inference/quick_tour
Apache License 2.0
2.84k stars 177 forks source link

Expose Optimized Transformers Inference for ETL #6

Open sam-h-bean opened 1 year ago

sam-h-bean commented 1 year ago

Feature request

I'd like to use this library for really high throughout ETLs along as an inference server. How I imagine this working is exposing some sort of object which can operate on in-memory datasets.

I am sort of running under the assumption this would be even more performant than native bettertransformer inference in memory.

Motivation

Inspired by vLLMs offering which is great for running LLMs in a big data or ETL setting.

from vllm import LLM
llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
output = llm.generate("San Franciso is a")

Your contribution

If this is a reasonable first task I would be happy to take a look.

michaelfeil commented 11 months ago

@sam-h-bean Not to easy, i tried that in a similar project (https://github.com/michaelfeil/infinity) - thats written in pure python, and a bit slower (2.5x less throughput).

Some starting points: Here the codebase is purely in Rust, using tokio for the async stuff, so you might want to launch a server with open port + you need a channel for the grcp things.

to solve your issue: