Content-Encoding: gzip - Githubissues

andrew-at-rise commented 3 months ago

I wonder if it would make sense to support compressed requests, esp. for /rerank, where the query and document list could be many 1k or 2k chunks of text? The incoming request could easily exceed 20 or 30k. The http server does not appear to handle gzipped request bodies, if present.

michaelfeil commented 3 months ago

Have you considered grcp protocol? If you fork the project and start building, thats something I potetntially would consider to pull in.

Questions:

I have never heard of gzip-requests - how does validation of requests (error 422 handling work?)
What kind of issues are you experienced when sending e.g. 2k requests? Why is this feature needed?
is sending 20-30k a good paragdim? When do you need it? Even with gpu you can encode around 200-1000 texts per second? I think this encourages a bad workload?

peebles commented 3 months ago

Does your FastAPI server accept gRPC? I am using your docker container, behind nginx terminating TLS as a reverse proxy. Nginx apparently can proxy gRPC.

content-encoding: gzip is pretty common. All browsers will try to compress their request bodies if the server accepts.
I am not experiencing any issues with large requests. They can be slower is all.
My RAG text chunks are about 1k. My prompt, coming from Continue in vscode can be quite large (like an entire file.js). I fetch like 20ish chunks from my vector database, then I want to re-rank. Think this is too much data?

Here is an example of decompression middleware for FastAPI:

from fastapi import FastAPI, Request
from starlette.middleware.base import BaseHTTPMiddleware
import gzip

class GZipRequestMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        if 'content-encoding' in request.headers and request.headers['content-encoding'] == 'gzip':
            # Decompress the request body
            body = await request.body()
            decompressed_body = gzip.decompress(body)

            # Create a new request with the decompressed body
            scope = request.scope
            scope['body'] = decompressed_body
            request = Request(scope)

        response = await call_next(request)
        return response

app = FastAPI()

# Add the middleware to the app
app.add_middleware(GZipRequestMiddleware)

After that, request.body is used just as before.

I'll look into gRPC. I need speed.

michaelfeil commented 3 months ago

@peebles Thanks for the extensive example. https://stackoverflow.com/questions/43628605/does-the-zlib-module-release-the-global-interpreter-lock-gil-in-python-3 -> I assume this will not affect the GIL or performance. decompressed_body = gzip.decompress(body) starlette integration seems elegant and without any extra dependencies at first glance!

Thoughts:

Could you do routing based on the json content?
Are you sure that the performance bottleneck is in sending/receiving the request? I think validation, tokenization, and especially forward pass of model will be much more compute heavy.
The response (embedding) should be all unique floats, with little pattern - json is kind of lossy, but I would consider adding a grcp server to be more elegant, and has more traction in the embedding community (https://github.com/huggingface/text-embeddings-inference?tab=readme-ov-file#grpc) grcp is not supported by fastapi

peebles commented 3 months ago

"routing based on json content" sounds intriguing but I am not sure I understand ...
I am not sure the performance bottleneck is network transfer, although the machine hosting Infinity is a big iron monster and executes reranking almost instantaneously. Network transfer is probably my longer term concern.
Typically, client compression is not turned on unless the request payload goes above some threshold, like 1k or so, where the cost of transfer becomes greater than the cost of compression/decompression.

I am doing /rerank, where the input (to you) is a potentially large amount of text, and the output is a very small summary ... no floats, all text. In /rerank, it may make sense to compress the input but not the output ... the output is too small.

As for "I assume this will not affect the GIL or performance. decompressed_body = gzip.decompress(body)", I don't know. I come from more of a NodeJS background where everything is async.

I have seen significant performance improvements on past projects when I started compressing large network requests between clients on AWS to MongoDB servers at Atlas for example. Which is why I looked into this on Infinity in the first place.

peebles commented 3 months ago

What is the difference between Infinity and https://github.com/huggingface/text-embeddings-inference?

michaelfeil commented 3 months ago

@peebles the most similar project out there - I think TEI is an exciting project showcasing a new framework in rust (I link rust). here are a couple of key differences.

Benchmarks: https://michaelfeil.eu/infinity/main/benchmarking/
more supported architectures (e.g. Mistral, due to Rust integration) + hardware
License: You can use infinity if you are working for-profit companies.

Re: Routing: https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-gzip-compression-decompression.html e.g. via AWS API Gateway and similar.

@peebles Feel free to PR the gzip compression, I can add a unit test if needed.

peebles commented 3 months ago

I'll look into doing the PR.

michaelfeil / infinity

Content-Encoding: gzip #136