tnc-ca-geo / animl-ml

Machine Learning resources for camera trap data processing
Other
4 stars 1 forks source link

Run serverless endpoint batch test and record cost and time results #99

Closed rbavery closed 1 year ago

rbavery commented 1 year ago

User story

We need to understand the cost of running the current architecture inference on large archives (25 Gb) of imagery. In terms of both time (does it take a week with retries? 2 days?) and in terms of costs for the serverless MDV5 endpoint that auto-scales with requests. For this first run, we won't include the Mira endpoints in this test.

we will run this test on duplicated images that matches ratio of animals/no animals. ~ 60% are empty. All are jpegs.

secondarily, we'd like to understand:

Things we need to run the test:

Resolution Criteria

For 25 Gb of random imagery , where sample images will be close to 1280x1280 (Natty will pick a representative range), how long does autoscale inference take for mdv5?

What was the cost per image? Did this vary throughout the job due to retries?

Were there any failures not resolved by retries?

rbavery commented 1 year ago

additional meeting notes:

Testing the impact of automation rules and addition of mira models during test would show multiple trips and writes to db impact cost

green light from mongodb to get dedicated instance for mongodb atlas. should improve performance of DB.

preference is to reduce cost over inference time. kinesis firehose or dynamo db would be more time performant DBs

we agreed to run without atlas next week, and assess if we need to run the test again with atlas later

rbavery commented 1 year ago

With letterboxing and the fully reproduced yolov5, we get average inference times of 9 seconds per image on sagemaker serverless, which only supports CPU.

Initialization time (model loading): 8.54904127120971
Preprocess time (letterbox): 0.032587528228759766
Inference time (model running on image of a given size): 8.05414867401123
Postprocessing time (NMS): 0.02366042137145996

Back when we ran the above test last year, we were testing on fixed resizing to 640x640 with a torchscript model compiled for the CPU, inference time was closer to 2.5 seconds per image: https://docs.google.com/spreadsheets/d/17t-zgKwWdVSArf7mgu4QJXOvtGVIlcUYTnwYEpNZQsU/edit#gid=0

We'll be exploring how to reduce inference time while preserving reproduced accuracy: https://github.com/tnc-ca-geo/animl-ml/issues/106

nathanielrindlaub commented 1 year ago

@rbavery deployed the ONNX MDv5 (PR here) to a Sagemaker Serverless endpoint and it looks like per-image inference is around 3.5-4 seconds.

The entire processing time for a test batch of 10,168 images was 11hrs, 8 mins (3.9 seconds per image).

So 1000 images takes roughly an hour to process, 100k would take 4.5 days.

Not bad for now! We'll explore speeding this up perhaps down the road by taking advantage of concurrent processing (having two separate Serverless endpoints for Megadetector - one for real-time inference needs and one for batch, and ditching the FIFO queues for standard SQS queues).

There are also endpoint and model level optimizations we could explore as well (#112).