runpod-workers / worker-infinity-embedding

MIT License
18 stars 9 forks source link

Performance expectations #5

Open Sopamo opened 4 months ago

Sopamo commented 4 months ago

Thanks for working on this!

I've been testing running embeddings in a runpod serverless environment, but the performance isn't what I would have expected. For running bge-m3, we're seeing an end to end latency of ~600ms. Runpod itself reports around 100ms delay time and around 110ms processing time.

I tried running bge-m3 locally on my machine (on a Geforce 4080, directly via python using BGEM3FlagModel) and for the first embedding I see a very high latency as well (~180ms), but for embeddings afterwards the latency is very low, as expected. Around 4-5ms for simple text.

I don't see obvious reasons why requests after the first one on a running worker would still take 100+ms. Is this something that can be improved somehow? I would be willing to contribute, but would like to ask first if this performance is to be expected or if there is potential to improve it.

I would also like to ask about the 100ms delay time. What could be the reasons for it being so high, even though the worker is already running?

We are using European Data centers. Could it be that the requests are somehow routed through the US?

This is the python script I used for testing locally:

import time

from FlagEmbedding import BGEM3FlagModel

model = BGEM3FlagModel('BAAI/bge-m3', use_fp16=False)

sentences = ["What is BGE M3?"]
sentences2 = ["More text"]
sentences3 = ["<<< More text"]

def get_embeddings(inputs):
    start_time = time.time()
    model.encode(inputs)['dense_vecs']
    end_time = time.time()

    execution_time = end_time - start_time
    print(f"Execution time: {execution_time} seconds")

get_embeddings(sentences)
get_embeddings(sentences2)
get_embeddings(sentences3)

Output:

Execution time: 0.214857816696167 seconds
Execution time: 0.004781007766723633 seconds
Execution time: 0.00433349609375 seconds
michaelfeil commented 3 months ago

For the first request, the cuda-graph might be intialized on the GPU, which leads to the cold-start delay. The application is designed for high throughput.

Sopamo commented 3 months ago

Yes, I wasn't confused about the cold start delay, but about the delay that are sent to warm workers. Sorry, I think I wasn't explicit enough about this. The numbers we are seeing when running via runpod were all measured with warm workers.

michaelfeil commented 3 months ago

Interesting, thanks for flagging that - maybe you hit a "cold replica"?

Sopamo commented 3 months ago

I just tried it again. For testing we only have a single worker deployed. The warm requests consistently have ~100ms delay and ~100ms execution time:

www runpod io_console_serverless_user_endpoint_wy2rllic173ghb

Here are the container logs, maybe that helps: logs.txt

I would be open to debugging this further myself, but before I do that I wanted to ask if there are any obvious reasons for these delays.

TimPietrusky commented 1 month ago

@Sopamo as far as I understand, there is no obvious reason for the delay.

When you tested this locally, you didn't use the worker-infinity-embedding right? So maybe that would be something to try out if you have the time and energy.

Sopamo commented 1 month ago

@TimPietrusky thanks for getting back to me! That is true, will try!

TimPietrusky commented 1 month ago

@Sopamo thank you very much! And please let us know if you find anything!

Sopamo commented 1 month ago

@TimPietrusky I did some more testing. The delay seems to depend on the data center I'm using. I chose 4 different data centers that are far apart from each other, configured it to use a single worker, made a request to warm up the worker and then did 5 requests directly after each other. The following are the lowest values for delay and execution that I got. The execution times seem ok, but the delay times are all very high (from the perspective of trying to build real time applications with it):

I also tried running the worker locally:

Percentage of the requests served within a certain time (ms)
  50%      6
  66%      6
  75%      7
  80%      7
  90%      8
  95%      9
  98%     10
  99%     12
 100%     27 (longest request)

As comparison I did the same against the runsync endpoint of my worker that's running in runpod:

So in total that gives us p90 of ~200ms instead of 7ms once the worker is running on runpod infrastructure (in this case EU-RO-1). I feel like this might either depend on some inefficiencies in the queueing code that runs in runpod prod, but not in the local rp_serve_api version, or it's just that there are a few http requests that the worker does to communicate with some kind of central runpod infrastructure and that adds up to add all of the additional time.

I'd love to get some feedback if this (using runpod serverless for real time applications) is something that you plan to improve, or if you currently focusing on providing good service for applications that don't rely on as little latency as possible.

I'd also be happy to help if I can, so let me know if I can do anything else :)