Conncurrency findings - Githubissues

nathanielrindlaub commented 1 year ago

@ingalls - documenting my findings here after testing various concurrency settings on the graphql Lambda, inference Lambda, and Sagemaker Serverless endpoints.

Ok so the TL;DR is that the message consumer (the inference Lambda) is doing all of the throttling work on the inference side of things. We have its concurrency currently set to 10, and the # of messages each instance can peel off at once set to 10, so it can concurrently process up to 100 messages at a time. If it's maxed out and all 10 instances are busy, the queue simply grows and waits until nference instances free up.

What that means is that for all downstream services (i.e., the Sagemaker endpoints), we need to be able to provision at least 100 concurrent instances of each to handle message processing under full strain. If we set Sagemaker endpoints' concurrency lower than that, we start seeing a lot of messages bounce back w/ ThrottlingError exceptions, which is sort of fine b/c they will get retried and eventually processed, but is sub-optimal because each time a message gets (re)consumed it gets blocked by future consumption for the duration of its VisibilityTimeout - which according to AWS docs should be 6x the function timeout value (in our case 12 mins) - so processing ends up taking a loooong time to finish if we allow the inference Lambda to spam under-provisioned downstream Sagemaker Endpoints.

In the case of multiple simultaneous batch queues, even though multiple queues get spun up, all the messages get consumed by the same inference Lambda and the inference lambda is still performing that throttling role, so simultaneous batches don't add to downstream strain.

Working backwards from our AWS account limits... "the total concurrency you can share between all serverless endpoints per Region in your account is 1000"... so if we wanted to host 10 different models in our current region, and if each needs both a realtime endpoint and a batch endpoint, that means each endpoint can have a concurrency setting of 50 (1000/ 10 models x 2). Note: right now we also have separate dev sets of endpoints so if we keep doing that it's 1000/ 10 x 4, but I think we should figure out a way to not use up concurrency quotas for dev because it seems wasteful.

So that's pretty solid... that would mean we could process ~10k messages in 40 mins or so. I think this is an excellent starting point, and if in the future we end up needing to deploy more than 10 models, we could:

dial down the concurrency to make room for more models in the region
look into deploying models to different regions
only stand up real-time endpoints for projects that will use them
ask AWS to increase out quota

I still have a few outstanding questions, however, and would love your thoughts @ingalls:

Is there an advantage or disadvantage between processing one message at a time and setting the inference lambda's concurrency at 100 vs processing batches of 10 messages at a time and setting the lambda's concurrency to 10? Fewer instances = cost savings? No strong feelings either way.
I want to re-interrogate why we need a separate sagemaker endpoints for our real-time and batch queues. They are getting fed from different queues, but ultimately messages from both queues are getting consumed by the same inference lambda, so I'm not sure that as it's currently configured we're adding any efficiency by having separate endpoints (inference Lambda is still going to bottleneck the rate of processing at 100 concurrent messages).
To that point ^ I am actually unclear now how are we currently prioritizing the real-time messages over batch now that we've moved away from the FIFO/priority check approach and given that the inference lambda is just picking off messages from both queues in no particular order and with no particular preference.
this is a reach, but ideally we'd figure out a way to set the concurrency of megadetector endpoints high, because they for sure will be the most frequently used, but set the classifier endpoints lower. That doesn't seem possible w/ out deploying different versions of the inference Lambda to the stack w/ varying levels of concurrency and then having separate set of queues for classifier requests vs object detection requests. That seems like it might be overly complicated but in theory would allow us to deploy more models to the same region.

The more I think about it, if we want to be able to prioritize processing realtime over batch, I think we'll need to deploy 2 sets of inference lambdas (w/ the realtime inference and corresponding sagemaker endpoints set to considerably lower concurrency). We can use the same inference handler just create another function entry in our serverless.yml with a different name and reservedConcurrency setting. This would allow us to truly process batch and realtime images in two separate, non-blocking streams.

Any thoughts on any of that? Does that all make sense?

ingalls commented 1 year ago

The primary/known advantage is cost as SQS charges by request and with our current 10x batch size we 10x decrease our SQS request costs. We should also see an increase in inference throughput as we request inferences from the SageMaker endpoint concurrently within the execution context.
(&3) @nathanielrindlaub 100% agree, we had talked about this on our last call that FiFo queues were the primary method in which we could handle priority and that without them the priority would be a little harder to quantify/guarantee. The most straightforward method to handle this now would be to deploy a batch specific lambda inference function (identical to the existing generic one) alongside the batch stack that is created. I don't have a good sense of how this would affect deploy times or throughput as it's not an infrastructure paradigm I've deployed before. Getting a handle on this would likely mean simply building it and comparing to existing throughput/times.
It might be worth reaching out to our TAM or support and asking what degree of flexibility this value has. This feels like it is likely a flexible value and limited to prevent an unexpected massive uptick in service usage by a single customer.

Concusion: 100% agree here - I think the next phase of work is likely to move the inference handler to a dockerized instance that can be called/shared with the api-ingest stack creation functionality. I anticipate this would take 2-4 hours to move from lambda inline to Docker and then another 2-4 to get the stack creation functionality to additionally create and call the lambda function.

nathanielrindlaub commented 1 year ago

@ingalls ok great I think I'm tracking, although I'm not sure I follow the need to move to a dockerized instance of the inference lambda that can be spun up by the ingestion stack each time a new upload gets requested.

In theory we only need one additional identical inference lambda that could handle processing messages from all the upload batch queues, right? So it doesn't seem necessary to create new instances dedicated to each new upload/batch stack. Maybe I'm missing something? LMK if you want to huddle quickly on this. I'm available most of today & tomorrow.

nathanielrindlaub commented 1 year ago

Making note of some outstanding todos here: We did add a second inference lambda so we now have a dedicated one for batch and a dedicated one for real-time, but I think we still need to adjust their reserved concurrencies (both are currently at 10, and the batch has a batch-size of 10, while the real-time has a batch size of 1). I think we should start with (a) assume we can support 10 models per region, so each model has 100 concurrent S.M. endpoints to work with, (b) do an 80/20 split between batch and real-time, respectively. That would entail:

[x] re-deploy all active sagemaker serverless endpoints w/ the real-time set w/ a concurrency of 20 and the batch at 80. Update SSM endpoint names as needed.
[x] set the reserved concurrency of the batch inference lambda to 8 (b/c it's consuming up to 10 messages a time 8 x 10 = 80)
[x] set the reserved concurrency of the real-time inference lambda to 20 (b/c it's only consuming 1 image at a time)

nathanielrindlaub commented 1 year ago

Also making a note here that I set the ingest-image lamnda's concurrency to 30. It was previously not set at all, and an upstream change in our unzipping library caused ingest-zip batch task to spam the ingestion s3, which caused 600+ ingest-image lambdas to spin up in response and caused all sorts of downstream trouble.

30 is arbitrary, so I may play around with it a bit and make sure the image record writing & and sqs queue population happens fast enough that the upload progress indicator is still relatively accurate.

nathanielrindlaub commented 1 year ago

For now the whole pipeline seems to be working smoothly, so closing this out.

tnc-ca-geo / animl-api

Conncurrency findings #101