Unclear meaning of delay parameter

why-el commented 10 months ago

Hi there,

Great library. I am dealing with a lambda whose cold starts are large, and as a stop-gap I am using this library. I set it up with a concurrency of 4, with no other changes. EventBridge polls the lambda every 5 minutes with the correct payload. However, I am only seeing 2 concurrent executions of the lambda (locally, so there are no other invocations besides what EventBridge triggers). I am confused as to how it is only 2, and not 4.

Following that, I am pretty sure I am misunderstanding the meaning of the delay parameter, which I do not change. This is a really slow lambda, its cold start is around 11 seconds or so, even before reaching the short-circuit code, so logically it must invoke four times, but this is never reported as such by the concurrentExecutions metric from Cloud watch. Any ideas?

why-el commented 10 months ago

Any ideas here folks? Not really sure who to tag so I refrain from doing so. :)

jeremydaly commented 10 months ago

Hi @why-el, I recommend people DON'T USE this library anymore as there are better ways to mitigate cold starts.

That being said, the delay setting tells the Lambda function to block the event loop so that other invocations will force additional concurrent connections. If your warmer payload is correct, then the function tells the system to call itself. This is why you're likely seeing the 2 concurrent invocations. The EventBridge trigger works, then it invokes itself again. It should try to invoke itself 2 additional times as noted, but if it isn't then it might be due to the long cold start you mention.

If it takes 11 seconds to cold start a new function before it even reaches the short circuit code, then that might be causing the issue.

why-el commented 10 months ago

Right, this library is a short-stop gap while we work to undo the slow starts rampant in this lambda. It actually does its job fairly well. We are working to use provisioned concurrency but some of the code is pretty old and is facing difficulties (for instance lack of top-level await in pre-handler code).

If it takes 11 seconds to cold start a new function before it even reaches the short circuit code, then that might be causing the issue.

I am not understanding this part. The library does block the event loop to further invoke itself, but if the setting is four, it should invoke 3 additional times, and therefore the concurrent executions observed should be at least 3. Is it correct to say that:

It blocks at the call site of this library.
It then invokes itself X amount of times concurrently? [1]

if concurrently, then regardless of the cold start, it should warm up 4 lambdas. Plotting the concurrent executions shows only 2.5, but that might not this library but rather lack of granularity from AWS (it shows 2.5 even with a 1 second window).

[1] I still don't understand how delay plays into this. I suspect delay and the cold start duration might be interlinked, but it's not clear how. Please correct me if I am wrong.

jeremydaly commented 10 months ago

I haven't used this library for awhile, so it's only a guess that 11 second cold starts could have something to do with it. The delay blocks the event loop on newly started functions. The idea is that if you are trying to achieve concurrency of 50, then it's possible that the warming function might take more than a few milliseconds to loop through and try to invoke all those other functions. If it takes 500 milliseconds to loop through and send an invoke to 50 functions, a default delay of 75 would mean that the first few concurrent invocations would complete and be reused by the subsequent warming requests.

The library does spit out some logs. What are those telling you?

why-el commented 10 months ago

Ah I see, so the delay is only to allow AWS to actually invoke all the functions. And for 4 functions, that should be very fast and 75ms plenty. 👍 It's interesting that it does not match up with the reported AWS concurrentInvocations metric, but that's because you loop and a loop and a network call will involve enough milliseconds to lower that metric slightly.

And actually I already studied the logs and they confirm correct behavior in my local AWS environment. My initial confusion was that perhaps delay plays more of a role in production because of the different invocation patterns, but I think not. Thanks @jeremydaly, super helpful.

naorpeled commented 10 months ago

@why-el I'll be marking this issue as completed, feel free to ping if you want me to re-open it or have any additional questions.

jeremydaly / lambda-warmer

Unclear meaning of delay parameter #59