Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting all sentence-transformer models and frameworks. Infinity is developed under MIT License. Infinity powers inference behind Gradient.ai.
v2
cli, including --api-key
In this demo sentence-transformers/all-MiniLM-L6-v2, deployed at batch-size=2. After initialization, from a second terminal 3 requests (payload 1,1,and 5 sentences) are sent via cURL.
pip install infinity-emb[all]
After your pip install, with your venv active, you can run the CLI directly.
infinity_emb v2 --model-id BAAI/bge-small-en-v1.5
Check the v2 --help
command to get a description for all parameters.
infinity_emb v2 --help
Instead of installing the CLI via pip, you may also use docker to run michaelf34/infinity
.
Make sure you mount your accelerator ( i.e. install nvidia-docker
and activate with --gpus all
).
port=7997
model1=michaelfeil/bge-small-en-v1.5
model2=mixedbread-ai/mxbai-rerank-xsmall-v1
volume=$PWD/data
docker run -it --gpus all \
-v $volume:/app/.cache \
-p $port:$port \
michaelf34/infinity:latest \
v2 \
--model-id $model1 \
--model-id $model2 \
--port $port
The cache path at inside the docker container is set by the environment variable HF_HOME
.
Instead of the cli & RestAPI use infinity's interface via the Python API.
This gives you most flexibility. The Python API builds on asyncio
with its await/async
features, to allow concurrent processing of requests. Arguments of the CLI are also available via Python.
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
sentences = ["Embed this is sentence via Infinity.", "Paris is in France."]
array = AsyncEngineArray.from_args([
EngineArgs(model_name_or_path = "BAAI/bge-small-en-v1.5", engine="torch", embedding_dtype="float32", dtype="auto")
])
async def embed_text(engine: AsyncEmbeddingEngine):
async with engine:
embeddings, usage = await engine.embed(sentences=sentences)
# or handle the async start / stop yourself.
await engine.astart()
embeddings, usage = await engine.embed(sentences=sentences)
await engine.astop()
asyncio.run(embed_text(array[0]))
Example embedding models:
dstack allows you to provision a VM instance on the cloud of your choice. Write a service configuration file as below for the deployment of BAAI/bge-small-en-v1.5
model wrapped in Infinity.
type: service
image: michaelf34/infinity:latest
env:
- INFINITY_MODEL_ID=BAAI/bge-small-en-v1.5;BAAI/bge-reranker-base;
- INFINITY_PORT=80
commands:
- infinity_emb v2
port: 80
Then, simply run the following dstack command. After this, a prompt will appear to let you choose which VM instance to deploy the Infinity.
dstack run . -f infinity/serve.dstack.yml --gpu 16GB
For more detailed tutorial and general information about dstack, visit the official doc.
Reranking gives you a score for similarity between a query and multiple documents. Use it in conjunction with a VectorDB+Embeddings, or as standalone for small amount of documents. Please select a model from huggingface that is a AutoModelForSequenceClassification with one class classification.
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
query = "What is the python package infinity_emb?"
docs = ["This is a document not related to the python package infinity_emb, hence...",
"Paris is in France!",
"infinity_emb is a package for sentence embeddings and rerankings using transformer models in Python!"]
array = AsyncEmbeddingEngine.from_args(
[EngineArgs(model_name_or_path = "mixedbread-ai/mxbai-rerank-xsmall-v1", engine="torch")]
)
async def rerank(engine: AsyncEmbeddingEngine):
async with engine:
ranking, usage = await engine.rerank(query=query, docs=docs)
print(list(zip(ranking, docs)))
# or handle the async start / stop yourself.
await engine.astart()
ranking, usage = await engine.rerank(query=query, docs=docs)
await engine.astop()
asyncio.run(rerank(array[0]))
When using the CLI, use this command to launch rerankers:
infinity_emb v2 --model-id mixedbread-ai/mxbai-rerank-xsmall-v1
Example models:
CLIP models are able to encode images and text at the same time.
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
sentences = ["This is awesome.", "I am bored."]
images = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
engine_args = EngineArgs(
model_name_or_path = "wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M",
engine="torch"
)
array = AsyncEngineArray.from_args([engine_args])
async def embed(engine: AsyncEmbeddingEngine):
await engine.astart()
embeddings, usage = await engine.embed(sentences=sentences)
embeddings_image, _ = await engine.image_embed(images=images)
await engine.astop()
asyncio.run(embed(array["wkcn/TinyCLIP-ViT-8M-16-Text-3M-YFCC15M"]))
Example models:
pip install timm
)Use text classification with Infinity's classify
feature, which allows for sentiment analysis, emotion detection, and more classification tasks.
import asyncio
from infinity_emb import AsyncEngineArray, EngineArgs, AsyncEmbeddingEngine
sentences = ["This is awesome.", "I am bored."]
engine_args = EngineArgs(
model_name_or_path = "SamLowe/roberta-base-go_emotions",
engine="torch", model_warmup=True)
array = AsyncEngineArray.from_args([engine_args])
async def classifier():
async with engine:
predictions, usage = await engine.classify(sentences=sentences)
# or handle the async start / stop yourself.
await engine.astart()
predictions, usage = await engine.classify(sentences=sentences)
await engine.astop()
asyncio.run(classifier(array["SamLowe/roberta-base-go_emotions"]))
Example models:
View the docs at https://michaelfeil.eu/infinity on how to get started.
After startup, the Swagger Ui will be available under {url}:{port}/docs
, in this case http://localhost:7997/docs
. You can also find a interactive preview here: https://infinity.modal.michaelfeil.eu/docs (and https://michaelfeil-infinity.hf.space/docs)
Install via Poetry 1.7.1 and Python3.11 on Ubuntu 22.04
cd libs/infinity_emb
poetry install --extras all --with test
To pass the CI:
cd libs/infinity_emb
make format
make lint
poetry run pytest ./tests
All contributions must be made in a way to be compatible with the MIT License of this repo.
@software{feil_2023_11630143,
author = {Feil, Michael},
title = {Infinity - To Embeddings and Beyond},
month = oct,
year = 2023,
publisher = {Zenodo},
doi = {10.5281/zenodo.11630143},
url = {https://doi.org/10.5281/zenodo.11630143}
}