Megadetector endpoint OOM error with fully reproduced model

The Docker container on https://github.com/tnc-ca-geo/animl-ml/pull/98 successfully reproduces MDV5a detections locally. However, when it is deployed as a serverless endpoint on Sagemaker, requests hang initially for many minutes and then produce an OOM error.

context

We are working with yolov5, a standard widely used object detection model, that is relatively low mem and has fast inference performance (a few seconds): https://github.com/ultralytics/yolov5

We are deploying this with a custom container built from a torchserve base image that copies a sagemaker entrypoint and config to the container. That is located here: https://github.com/tnc-ca-geo/animl-ml/pull/98

In the past we successfully deployed this model to sagemaker serverless without a preprocessing step to resize the image. Using this commit: https://github.com/tnc-ca-geo/animl-ml/pull/98/commits/9c0ec844516e9c1655480eac9de0cbccd0186568

However this new model with the two changes now causes the sagemaker memory limitation error, even with an endpoint config that has the max 6Gb of memory. Before for the working deployment, I was able to use a 4Gb serverless endpoint. On local, I've confirmed the container is limited to under 6gb, so I think sagemaker is not returning the correct error since the container isn't using more than 6Gb on local. Docker stats confirms this

→ docker stats --no-stream

CONTAINER ID   NAME             CPU %     MEM USAGE / LIMIT     MEM %     NET I/O          BLOCK I/O       PIDS

9b2035d8d906   confident_saha   1.07%     5.175GiB / 5.807GiB   89.12%    208kB / 28.6kB   352MB / 922kB   62

1 final data point, the endpoint accepts requests and returns good inferences when I change the instance concurrency to 1, but only after 30 seconds of inference time. This is much longer than the time it takes to spin up and load the torchserve and model + perform inference on local (about 7 seconds to start the server and load the model, 7 seconds to run inference on my Mac). So it seems like changing the instance count helps, but the concurrency at 5 wasn't an issue for the old yolov5 model without the specific resize step so it doesn't seem like the root of the issue.

things we changed in the new deployment

fully reproduced the preprocessing step applied to each image, which has a negligible memory footprint.
We also changed how we load the model so that we load the model weight file using yolov5 instead of a portable torchscript file. This also has a negligible mem footprint.

Once we added these steps and test the container locally, inference works as intended similar to the old deployment, with more accurate results.

Error logs

Here is the failing endpoint and some sample logs from when I sent a request

https://us-west-2.console.aws.amazon.com/cloudwatch/home?region=us-west-2#logsV2:log-groups/log-group/$252Faws$252Fsagemaker$252FEndpoints$252Fmdv5a-letterbox/log-events/AllTraffic$252Fd9a382cf1da9ba7cafdfe4d5f96e3374-2445ef8648874902bd52bd2d1bfa603c

And the error

2023-02-07T00:58:35,236 [ERROR] W-9001-mdv5_1.0.0 org.pytorch.serve.wlm.WorkerThread - Backend worker error

org.pytorch.serve.wlm.WorkerInitializationException: Backend worker startup time out.

On the client side, this is the response with a first request. Subsequent requests state it is an out of memory error:

{'EndpointName': 'megadetectorv5-torchserve-serverless-prod',
 'EndpointArn': 'arn:aws:sagemaker:us-west-2:830244800171:endpoint/megadetectorv5-torchserve-serverless-prod',
 'EndpointConfigName': 'megadetectorv5-torchserve-serverless-config-prod',
 'EndpointStatus': 'Failed',
 'FailureReason': 'Request to service failed. If failure persists after retry, contact customer support.',
 'CreationTime': datetime.datetime(2023, 2, 2, 21, 9, 14, 703000, tzinfo=tzlocal()),
 'LastModifiedTime': datetime.datetime(2023, 2, 2, 21, 12, 12, 654000, tzinfo=tzlocal()),
 'ResponseMetadata': {'RequestId': '3d2ca208-2ef7-4340-aa17-3c232fb3dc98',
  'HTTPStatusCode': 200,
  'HTTPHeaders': {'x-amzn-requestid': '3d2ca208-2ef7-4340-aa17-3c232fb3dc98',
   'content-type': 'application/x-amz-json-1.1',
   'content-length': '438',
   'date': 'Thu, 02 Feb 2023 21:12:18 GMT'},
  'RetryAttempts': 0}}

tnc-ca-geo / animl-ml