michaelfeil / infinity

Infinity is a high-throughput, low-latency REST API for serving vector embeddings, supporting a wide range of text-embedding models and frameworks.
https://michaelfeil.eu/infinity/
MIT License
972 stars 72 forks source link

Content-Encoding: gzip #136

Open andrew-at-rise opened 3 months ago

andrew-at-rise commented 3 months ago

I wonder if it would make sense to support compressed requests, esp. for /rerank, where the query and document list could be many 1k or 2k chunks of text? The incoming request could easily exceed 20 or 30k. The http server does not appear to handle gzipped request bodies, if present.

michaelfeil commented 3 months ago

Have you considered grcp protocol? If you fork the project and start building, thats something I potetntially would consider to pull in.

Questions:

peebles commented 3 months ago

Does your FastAPI server accept gRPC? I am using your docker container, behind nginx terminating TLS as a reverse proxy. Nginx apparently can proxy gRPC.

Here is an example of decompression middleware for FastAPI:

from fastapi import FastAPI, Request
from starlette.middleware.base import BaseHTTPMiddleware
import gzip

class GZipRequestMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        if 'content-encoding' in request.headers and request.headers['content-encoding'] == 'gzip':
            # Decompress the request body
            body = await request.body()
            decompressed_body = gzip.decompress(body)

            # Create a new request with the decompressed body
            scope = request.scope
            scope['body'] = decompressed_body
            request = Request(scope)

        response = await call_next(request)
        return response

app = FastAPI()

# Add the middleware to the app
app.add_middleware(GZipRequestMiddleware)

After that, request.body is used just as before.

I'll look into gRPC. I need speed.

michaelfeil commented 3 months ago

@peebles Thanks for the extensive example. https://stackoverflow.com/questions/43628605/does-the-zlib-module-release-the-global-interpreter-lock-gil-in-python-3 -> I assume this will not affect the GIL or performance. decompressed_body = gzip.decompress(body) starlette integration seems elegant and without any extra dependencies at first glance!

Thoughts:

  1. Could you do routing based on the json content?
  2. Are you sure that the performance bottleneck is in sending/receiving the request? I think validation, tokenization, and especially forward pass of model will be much more compute heavy.
  3. The response (embedding) should be all unique floats, with little pattern - json is kind of lossy, but I would consider adding a grcp server to be more elegant, and has more traction in the embedding community (https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#grpc) grcp is not supported by fastapi
peebles commented 3 months ago

I am doing /rerank, where the input (to you) is a potentially large amount of text, and the output is a very small summary ... no floats, all text. In /rerank, it may make sense to compress the input but not the output ... the output is too small.

As for "I assume this will not affect the GIL or performance. decompressed_body = gzip.decompress(body)", I don't know. I come from more of a NodeJS background where everything is async.

I have seen significant performance improvements on past projects when I started compressing large network requests between clients on AWS to MongoDB servers at Atlas for example. Which is why I looked into this on Infinity in the first place.

peebles commented 3 months ago

What is the difference between Infinity and https://github.com/huggingface/text-embeddings-inference?

michaelfeil commented 3 months ago

@peebles the most similar project out there - I think TEI is an exciting project showcasing a new framework in rust (I link rust). here are a couple of key differences.

Re: Routing: https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-gzip-compression-decompression.html e.g. via AWS API Gateway and similar.

@peebles Feel free to PR the gzip compression, I can add a unit test if needed.

peebles commented 3 months ago

I'll look into doing the PR.