Expose Optimized Transformers Inference for ETL

huggingface / text-embeddings-inference

A blazing fast inference solution for text embeddings models

Apache License 2.0

2.84k stars 177 forks source link

Feature request

I'd like to use this library for really high throughout ETLs along as an inference server. How I imagine this working is exposing some sort of object which can operate on in-memory datasets.

I am sort of running under the assumption this would be even more performant than native bettertransformer inference in memory.

Motivation

Inspired by vLLMs offering which is great for running LLMs in a big data or ETL setting.

from vllm import LLM
llm = LLM("facebook/opt-13b", tensor_parallel_size=4)
output = llm.generate("San Franciso is a")

Your contribution

If this is a reasonable first task I would be happy to take a look.

huggingface / text-embeddings-inference