harris-chris / Jot.jl

Streamlines the creation and management of AWS Lambda functions written in Julia
MIT License
41 stars 6 forks source link

Long execution times #37

Closed sklink closed 1 year ago

sklink commented 1 year ago

We're seeing initial execution times of 60s, while subsequent calls are 20ms. While we had thought this was a Lambda cold start issue, we've set up provisioned concurrency and confirmed with support that provisioned instances are being used:

"...you can see that the latency is equally long when it is provisioned concurrent invocation. Therefore ... the provisioned concurrency is correctly set.

They seem to suggest that there are two separate sections: Lambda handler and core logic.

"We recommend that you separate the code inside the lambda handler so that these long-time logic can be executed in the init phase."

Here's the document they referenced: https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html#function-code

I still don't understand why we would be seeing subsequent requests coming in at 20ms. Seems like something is being cached and reused. I would assume that having provisioned instances would already have the initialization code ready, but maybe that switch to core logic is required for that step. Any thoughts?

Some more context:

Again, the initial phase of the function does not exist during provisioned concurrency because the initial phase is set in advance. However, if the logic that takes a long time does not run during init duration, it cannot be affected by provisioned concurrency setting.

To sum up, you need to save time by separating the time-consuming initialization logic outside the Lambda handler.

harris-chris commented 1 year ago

Yes, this is quite confusing. The idea with Jot, which seemed like a good idea when I wrote it, is to try to spread the computation work intelligently between the three steps below:

  1. Image build time: pack as much of the work as possible into this stage (which only runs once, and entirely outside of AWS). This includes instantiating and (if package_compile is requested) compiling. Within the Dockerfile, an init.jl script gets run that does this work.
  2. Function instantiation time: when the lambda function is instantiated, the HTTP server that handles incoming function calls starts. The precompiled/cached data that Julia needs should already exist in the DEPOT_PATH, so this should be quick. 3: Function call time: when a function call arrives, the actual responder function is called. As with the HTTP server this should be cached/precompiled, so again this should be quick.

When testing against a local docker image (entirely outside of AWS), function response times are much faster with package_compile set to true, so (outside of AWS) the logic above seems to be working.

It is hard to infer why AWS is taking so long in your situation. If provisioned concurrency is on, then only stage 3 of the stages above is being run, so it should be very quick.

Rather that trying to guess what's going on, I think it's probably easiest to add some good debug output to the code that gets called in stages 1-3 above, and then to add an option to run the function in this debug mode. I think I can do that in the next couple of days.

If I get some time today I'll check that times I'm getting running the image locally vs via Lambda, and with package_compile on vs off.

How often do you see that 60-second response time? Is it literally just the first time you run a newly-created function?

On Tue, 22 Nov 2022, 01:30 Matt Doak, @.***> wrote:

We're seeing initial execution times of 60s, while subsequent calls are 20ms. While we had thought this was a Lambda cold start issue, we've set up provisioned concurrency and confirmed with support that provisioned instances are being used:

"...you can see that the latency is equally long when it is provisioned concurrent invocation. Therefore ... the provisioned concurrency is correctly set.

They seem to suggest that there are two separate sections: Lambda handler and core logic.

"We recommend that you separate the code inside the lambda handler so that these long-time logic can be executed in the init phase."

Here's the document they referenced:

https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html#function-code

I still don't understand why we would be seeing subsequent requests coming in at 20ms. Seems like something is being cached and reused. I would assume that having provisioned instances would already have the initialization code ready, but maybe that switch to core logic is required for that step. Any thoughts?

Some more context:

Again, the initial phase of the function does not exist during provisioned concurrency because the initial phase is set in advance. However, if the logic that takes a long time does not run during init duration, it cannot be affected by provisioned concurrency setting.

To sum up, you need to save time by separating the time-consuming initialization logic outside the Lambda handler.

— Reply to this email directly, view it on GitHub https://github.com/harris-chris/Jot.jl/issues/37, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALDMVSHHHHAKNLCRHPR4MWLWJOPRFANCNFSM6AAAAAASG2QPMY . You are receiving this because you are subscribed to this thread.Message ID: @.***>

harris-chris commented 1 year ago

This is a bit harder than I'd presumed, the function log needs to be captured from CloudWatch, which is a tiny bit fiddly. This is more likely to take a week or two.

sklink commented 1 year ago

The long response times happen on every request, as long as they're spaced out. I'm not sure exactly how long they need to be spaced... subsequent responses (within the next minute or two) come through with the short response time.

We've set up a CRON to hit the function every minute which seems to be doing the trick to keep it alive for now.

Given those symptoms, my first guess was that it's not using provisioned concurrency. Since AWS says it is... I have to think the provisioned instances are being cycled, losing something cached within. Can't think of anything else that makes sense.

harris-chris commented 1 year ago

Like you say, it does sound as though provisioned concurrency isn't working as it should. The change I'm thinking to make is to the run_test function

On Sat, 26 Nov 2022, 00:43 Matt Doak, @.***> wrote:

The long response times happen on every request, as long as they're spaced out. I'm not sure exactly how long they need to be spaced... subsequent responses (within the next minute or two) come through with the short response time.

We've set up a CRON to hit the function every minute which seems to be doing the trick to keep it alive for now.

Given those symptoms, my first guess was that it's not using provisioned concurrency. Since AWS says it is... I have to think the provisioned instances are being cycled, losing something cached within. Can't think of anything else that makes sense.

— Reply to this email directly, view it on GitHub https://github.com/harris-chris/Jot.jl/issues/37#issuecomment-1327641221, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALDMVSGUTEH2YNIO6KII3PTWKDNBZANCNFSM6AAAAAASG2QPMY . You are receiving this because you commented.Message ID: <harris-chris/Jot .@.***>

harris-chris commented 1 year ago

OK, I've added some features to allow better performance diagnostics. I've been looking at this a bit myself, and the time delay seems to come from Julia precompiling some functions, at run-time. The next step is to think about how to prevent this - out of interest, do your functions tend to have any tests? One option would be to have Jot run the tests when building the docker image, this should help, at least so far as I understand the situation.

sklink commented 1 year ago

We do have basic tests for the functions mapped to Lambda, yes. There are some helper functions we've written within the package that do not have tests.

Would it help if I retry after adding tests for the helper functions as well?

sklink commented 1 year ago

@harris-chris, we're putting this higher priority for ourselves so I'm going to pick up the investigation next week. Is there a branch aside from main you have those performance diagnostics on?

Any suggestions you have about where to look or things to try would be helpful. Thanks!

harris-chris commented 1 year ago

Hello - I believe they're on the main branch at the moment (I'm away from my PC this weekend). I had a brief look at this (also reading through this: https://julialang.org/blog/2021/01/precompile_tutorial/) late last year. The diagnostics did show a few functions that I had expected to be compiled, were not, so hopefully it's just a matter of improving the precompile requests in the code.

I can give this some time next week, as soon as I can I'll share the process I was using to look at this so at least we'll be starting on the same page.

On Sat, 11 Feb 2023, 05:26 Matt Doak, @.***> wrote:

@harris-chris https://github.com/harris-chris, we're putting this higher priority for ourselves so I'm going to pick up the investigation next week. Is there a branch aside from main you have those performance diagnostics on?

Any suggestions you have about where to look or things to try would be helpful. Thanks!

— Reply to this email directly, view it on GitHub https://github.com/harris-chris/Jot.jl/issues/37#issuecomment-1426300788, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALDMVSFSMR4RD3GL6U7LJS3WW2P5TANCNFSM6AAAAAASG2QPMY . You are receiving this because you were mentioned.Message ID: @.***>

sklink commented 1 year ago

That's really helpful. Thanks!

sklink commented 1 year ago

Since polling provisioned instances seems to take them from "warm" to "ready" (i.e. 1st request finishes compilation @ 60s and 2nd request is instant @ 20ms)... would it make sense to have the Lambda Docker container do a dry run of the function when it starts up?

I know this wouldn't be a true fix, but I'm thinking that this would ensure the provisioned instances are actually ready for requests as "provisioned instances" are supposed to be.

I was going to give that a shot but I can't figure out where I should add that in. If you think this might work, can you point me to the spot I need add the call to the function?

harris-chris commented 1 year ago

If you have a look at the new performance-debugging branch, you should be able to see some scripts that I've created to get a better understanding of the performance issues. There's also the new functions like show_observations (these are covered in the documentation of the dev documentation set) which can be used to obtain greater detail. My initial thoughts are:

harris-chris commented 1 year ago

Since polling provisioned instances seems to take them from "warm" to "ready" (i.e. 1st request finishes compilation @ 60s and 2nd request is instant @ 20ms)... would it make sense to have the Lambda Docker container do a dry run of the function when it starts up?

I know this wouldn't be a true fix, but I'm thinking that this would ensure the provisioned instances are actually ready for requests as "provisioned instances" are supposed to be.

I was going to give that a shot but I can't figure out where I should add that in. If you think this might work, can you point me to the spot I need add the call to the function?

If I understand the situation correctly, there's two occasions a container will start up:

  1. When being polled, if it's a provisioned instance, and
  2. When the lambda function is being called directly I can see how your suggestion helps with your use case, but most people will be in the second of these two categories, and having the function do a dry run on container start-up will slow things down for them.

If you'd like to go ahead and do it for your own for, the place to put this would be within get_bootstrap_script, in src/Scripts.jl. It's a bit of a fiddly piece of code to edit because there are so many special character escapes. There's the test suite in ./test which you'll probably find useful if you're making any meaningful changes to the code.

harris-chris commented 1 year ago

Still working on this, haven't got that much clarity on the situation yet.

harris-chris commented 1 year ago

This has been quite tricky, but I think I now have a clear idea of how to get this working in a clean way. The system image that PackageCompiler uses doesn't need to be created within the Dockerfile build process, and attempting to create it there (which is that currently happens) is making debugging hard and generally causing confusion. I'm going to switch to creating the system image outside of the image build process, then subsequently include it in the docker image. I don't think this need cause any breaking change in API. I'll try to get this up and running (if maybe not documented) in the next couple of weeks.

harris-chris commented 1 year ago

If you try the new commit on the main branch, it should be much faster - I've been doing some testing with timeouts on PackageCompiler-ed uncompiled lambda functions, and I've been getting from-cold responses lower than 500ms (down from 8000), and from-hot responses in 5 or 6ms. The time-out at which it goes from hot to cold seems to be strangely variable, sometimes about 5 minutes and sometimes up to three hours. I'm going to get some more results on that and then put a page on the README about it. The API has slightly changed, you will now need to also pass a FunctionTestData object to create_local_image when package_compile is true - this is just a simple data object that has a test invocation argument and the expected response.

More documentation to follow on this.

harris-chris commented 1 year ago

Closing this as it is resolved in v0.4.1, feel free to open this again if you're still having issues.

sklink commented 1 year ago

Thanks @harris-chris, missed the notifications here. We landed on a dedicated server for Julia as a way to get around the cold boot issues. Execution times were short enough that it should suffice for a while. I'll make a note to revisit this to transition back to Lambda. Thanks again!