Closed sklink closed 1 year ago
Yes, this is quite confusing. The idea with Jot, which seemed like a good idea when I wrote it, is to try to spread the computation work intelligently between the three steps below:
package_compile
is requested) compiling. Within the
Dockerfile, an init.jl
script gets run that does this work.When testing against a local docker image (entirely outside of AWS),
function response times are much faster with package_compile
set to true,
so (outside of AWS) the logic above seems to be working.
It is hard to infer why AWS is taking so long in your situation. If provisioned concurrency is on, then only stage 3 of the stages above is being run, so it should be very quick.
Rather that trying to guess what's going on, I think it's probably easiest to add some good debug output to the code that gets called in stages 1-3 above, and then to add an option to run the function in this debug mode. I think I can do that in the next couple of days.
If I get some time today I'll check that times I'm getting running the
image locally vs via Lambda, and with package_compile
on vs off.
How often do you see that 60-second response time? Is it literally just the first time you run a newly-created function?
On Tue, 22 Nov 2022, 01:30 Matt Doak, @.***> wrote:
We're seeing initial execution times of 60s, while subsequent calls are 20ms. While we had thought this was a Lambda cold start issue, we've set up provisioned concurrency and confirmed with support that provisioned instances are being used:
"...you can see that the latency is equally long when it is provisioned concurrent invocation. Therefore ... the provisioned concurrency is correctly set.
They seem to suggest that there are two separate sections: Lambda handler and core logic.
"We recommend that you separate the code inside the lambda handler so that these long-time logic can be executed in the init phase."
Here's the document they referenced:
https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html#function-code
I still don't understand why we would be seeing subsequent requests coming in at 20ms. Seems like something is being cached and reused. I would assume that having provisioned instances would already have the initialization code ready, but maybe that switch to core logic is required for that step. Any thoughts?
Some more context:
Again, the initial phase of the function does not exist during provisioned concurrency because the initial phase is set in advance. However, if the logic that takes a long time does not run during init duration, it cannot be affected by provisioned concurrency setting.
To sum up, you need to save time by separating the time-consuming initialization logic outside the Lambda handler.
— Reply to this email directly, view it on GitHub https://github.com/harris-chris/Jot.jl/issues/37, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALDMVSHHHHAKNLCRHPR4MWLWJOPRFANCNFSM6AAAAAASG2QPMY . You are receiving this because you are subscribed to this thread.Message ID: @.***>
This is a bit harder than I'd presumed, the function log needs to be captured from CloudWatch, which is a tiny bit fiddly. This is more likely to take a week or two.
The long response times happen on every request, as long as they're spaced out. I'm not sure exactly how long they need to be spaced... subsequent responses (within the next minute or two) come through with the short response time.
We've set up a CRON to hit the function every minute which seems to be doing the trick to keep it alive for now.
Given those symptoms, my first guess was that it's not using provisioned concurrency. Since AWS says it is... I have to think the provisioned instances are being cycled, losing something cached within. Can't think of anything else that makes sense.
Like you say, it does sound as though provisioned concurrency isn't working
as it should. The change I'm thinking to make is to the run_test
function
On Sat, 26 Nov 2022, 00:43 Matt Doak, @.***> wrote:
The long response times happen on every request, as long as they're spaced out. I'm not sure exactly how long they need to be spaced... subsequent responses (within the next minute or two) come through with the short response time.
We've set up a CRON to hit the function every minute which seems to be doing the trick to keep it alive for now.
Given those symptoms, my first guess was that it's not using provisioned concurrency. Since AWS says it is... I have to think the provisioned instances are being cycled, losing something cached within. Can't think of anything else that makes sense.
— Reply to this email directly, view it on GitHub https://github.com/harris-chris/Jot.jl/issues/37#issuecomment-1327641221, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALDMVSGUTEH2YNIO6KII3PTWKDNBZANCNFSM6AAAAAASG2QPMY . You are receiving this because you commented.Message ID: <harris-chris/Jot .@.***>
OK, I've added some features to allow better performance diagnostics. I've been looking at this a bit myself, and the time delay seems to come from Julia precompiling some functions, at run-time. The next step is to think about how to prevent this - out of interest, do your functions tend to have any tests? One option would be to have Jot run the tests when building the docker image, this should help, at least so far as I understand the situation.
We do have basic tests for the functions mapped to Lambda, yes. There are some helper functions we've written within the package that do not have tests.
Would it help if I retry after adding tests for the helper functions as well?
@harris-chris, we're putting this higher priority for ourselves so I'm going to pick up the investigation next week. Is there a branch aside from main you have those performance diagnostics on?
Any suggestions you have about where to look or things to try would be helpful. Thanks!
Hello - I believe they're on the main branch at the moment (I'm away from my PC this weekend). I had a brief look at this (also reading through this: https://julialang.org/blog/2021/01/precompile_tutorial/) late last year. The diagnostics did show a few functions that I had expected to be compiled, were not, so hopefully it's just a matter of improving the precompile requests in the code.
I can give this some time next week, as soon as I can I'll share the process I was using to look at this so at least we'll be starting on the same page.
On Sat, 11 Feb 2023, 05:26 Matt Doak, @.***> wrote:
@harris-chris https://github.com/harris-chris, we're putting this higher priority for ourselves so I'm going to pick up the investigation next week. Is there a branch aside from main you have those performance diagnostics on?
Any suggestions you have about where to look or things to try would be helpful. Thanks!
— Reply to this email directly, view it on GitHub https://github.com/harris-chris/Jot.jl/issues/37#issuecomment-1426300788, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALDMVSFSMR4RD3GL6U7LJS3WW2P5TANCNFSM6AAAAAASG2QPMY . You are receiving this because you were mentioned.Message ID: @.***>
That's really helpful. Thanks!
Since polling provisioned instances seems to take them from "warm" to "ready" (i.e. 1st request finishes compilation @ 60s and 2nd request is instant @ 20ms)... would it make sense to have the Lambda Docker container do a dry run of the function when it starts up?
I know this wouldn't be a true fix, but I'm thinking that this would ensure the provisioned instances are actually ready for requests as "provisioned instances" are supposed to be.
I was going to give that a shot but I can't figure out where I should add that in. If you think this might work, can you point me to the spot I need add the call to the function?
If you have a look at the new performance-debugging
branch, you should be able to see some scripts that I've created to get a better understanding of the performance issues. There's also the new functions like show_observations
(these are covered in the documentation of the dev documentation set) which can be used to obtain greater detail. My initial thoughts are:
Since polling provisioned instances seems to take them from "warm" to "ready" (i.e. 1st request finishes compilation @ 60s and 2nd request is instant @ 20ms)... would it make sense to have the Lambda Docker container do a dry run of the function when it starts up?
I know this wouldn't be a true fix, but I'm thinking that this would ensure the provisioned instances are actually ready for requests as "provisioned instances" are supposed to be.
I was going to give that a shot but I can't figure out where I should add that in. If you think this might work, can you point me to the spot I need add the call to the function?
If I understand the situation correctly, there's two occasions a container will start up:
If you'd like to go ahead and do it for your own for, the place to put this would be within get_bootstrap_script
, in src/Scripts.jl
. It's a bit of a fiddly piece of code to edit because there are so many special character escapes. There's the test suite in ./test
which you'll probably find useful if you're making any meaningful changes to the code.
Still working on this, haven't got that much clarity on the situation yet.
This has been quite tricky, but I think I now have a clear idea of how to get this working in a clean way. The system image that PackageCompiler uses doesn't need to be created within the Dockerfile build process, and attempting to create it there (which is that currently happens) is making debugging hard and generally causing confusion. I'm going to switch to creating the system image outside of the image build process, then subsequently include it in the docker image. I don't think this need cause any breaking change in API. I'll try to get this up and running (if maybe not documented) in the next couple of weeks.
If you try the new commit on the main branch, it should be much faster - I've been doing some testing with timeouts on PackageCompiler
-ed uncompiled lambda functions, and I've been getting from-cold responses lower than 500ms (down from 8000), and from-hot responses in 5 or 6ms. The time-out at which it goes from hot to cold seems to be strangely variable, sometimes about 5 minutes and sometimes up to three hours. I'm going to get some more results on that and then put a page on the README about it. The API has slightly changed, you will now need to also pass a FunctionTestData
object to create_local_image
when package_compile
is true
- this is just a simple data object that has a test invocation argument and the expected response.
More documentation to follow on this.
Closing this as it is resolved in v0.4.1, feel free to open this again if you're still having issues.
Thanks @harris-chris, missed the notifications here. We landed on a dedicated server for Julia as a way to get around the cold boot issues. Execution times were short enough that it should suffice for a while. I'll make a note to revisit this to transition back to Lambda. Thanks again!
We're seeing initial execution times of 60s, while subsequent calls are 20ms. While we had thought this was a Lambda cold start issue, we've set up provisioned concurrency and confirmed with support that provisioned instances are being used:
They seem to suggest that there are two separate sections: Lambda handler and core logic.
Here's the document they referenced: https://docs.aws.amazon.com/lambda/latest/dg/best-practices.html#function-code
I still don't understand why we would be seeing subsequent requests coming in at 20ms. Seems like something is being cached and reused. I would assume that having provisioned instances would already have the initialization code ready, but maybe that switch to core logic is required for that step. Any thoughts?
Some more context: