milliHQ / terraform-aws-next-js

Terraform module for building and deploying Next.js apps to AWS. Supports SSR (Lambda), Static (S3) and API (Lambda) pages.
https://registry.terraform.io/modules/milliHQ/next-js/aws
Apache License 2.0
1.46k stars 152 forks source link

Add Support for Provisioned Concurrency #248

Open curtis-trynow-io opened 2 years ago

curtis-trynow-io commented 2 years ago

This is another thing that would be very nice to have available for latency sensitive applications or just applications where users do not tolerate cold starts well. On the Terraform side of things, this is accomplished normally with:

  1. https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_function and setting publish to be true so that each creation is under a new version.
  2. (optional) https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_alias defining an alias that gets updated and is used for consistently defining the provisioned concurrency configuration.
  3. https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/lambda_provisioned_concurrency_config which then puts the provisioned concurrency in effect by function name, either a version or alias qualifier, and the number of hot copies of the code you want.

You also need to use the qualified ARN for API Gateway routing then too so that requests land on the appropriate version with the provisioned concurrency setup.

Thinking this through, I think this could be configured in the top level with either 1 or 2 (potentially splitting for page vs API Lambda functions) where if not null, then define the PC config and put it in place. I think that should be the gist of it though.

For reference, if I try to do this now, Terraform will put new code under $LATEST as expected and revert the routes to point at $LATEST. This does not eliminate the PC config but renders it useless. I have not tried manually versioning and updating after a deployment and then repeating as a manual workaround.

ofhouse commented 2 years ago

Hi, I personally don't see the need for provisioned concurrency from a user perspective:

  1. Usage of provisioned concurrency would mostly benefit sites with low traffic.
    Sites with high traffic have a higher possibility to hit a warm Lambda. So you would basically pay a very high price (provisioned concurrency is really expensive) to serve a site with low traffic.
    That's exactly the opposite of the selling point for serverless applications: You can use it for low traffic sites and don't have to pay the bills for an idling server.

  2. In general the average cold start time is between 200ms and 400ms for JavaScript.
    For a low-traffic site this is still very fast.

Don't get me wrong, I am not against implementing provisioned concurrency here, but I think provisioned concurrency was built with another use case in mind. E.g. creating a CloudWatch event that triggers a route/Lambda every 15 minutes (or scheduled to be active when the site hits low traffic) would be a much cheaper option compared to provisioned concurrency.

curtis-trynow-io commented 2 years ago

So let me give some context because I do not disagree that that average cold start time is acceptable. In experimentation, that cold start time (200 - 400ms) is only actually achievable if you do not touch resources inside a VPC and the function purely runs in Lambda. With VPC overhead, we see cold starts start creeping up into the 2 - 3 second range, and this is across all runtimes and not just NodeJS. One of our primary reasons for deploying to AWS and not Vercel is to access VPC resources without having to expose them publicly; Vercel's recommendation to expose things like DBs to all IPs and use password rotation is laughable.

If I manually hack in a provisioned concurrency (PC) config, these execution times drop to just code time as you would expect. What is unusual to me is that with NextJS, these cold start times are especially brutal. Here is a screenshot contrasting a cold start vs a normal run captured straight from CloudWatch:

image

So 300ms vs 7500ms. The difference is staggering with the latter being too painful for any user to be willing to tolerate. I assume some of that pain is NextJS doing some server-side rendering on first load, but if I can keep that hot, then that is one-time pain and not every-time pain.

Another thing I want to address is that PC is mostly beneficial for low traffic sites (your first point). Our findings have been that it is basically essential to have a small pool of PC functions available for everything coupled with an autoscaling policy or else you get killed by chained cold starts. Lambda needs to start a second copy of a function (second container) every time there is not an idle paused container available, so if you get a spiky workload, your first request can rack up 2s cold start penalties all the way through along with repeated attempts. The cost is negligible compared to having a terrible user experience.

I hope that helps contextualize things.

curtis-trynow-io commented 2 years ago

Hey @ofhouse , I wound up doing some internal testing of this targeting localhost and avoiding Lambda entirely following doing some other work in which I removed all VPC networking from our deployed code in a given environment (so only Lambda overhead existed) and still found that the start-up time was bad. I then compared that against a vanilla NextJS app and that basically pinned it down to it being our code at fault.

With that in mind though, I think that provisioned concurrency presents an immediate way to pay to "fix" this issue for people that are willing to pay and opens up another way to run NextJS code on AWS in a way that Lambda supports that is currently walled off. What do you think?

ofhouse commented 2 years ago

Thanks for providing more context. 👍 Agree that for VPC integrations PC could provide a quick win.

However I still think that invoking the Lambdas through scheduled CloudWatch events is still a more preferable solution to keep the Lambdas warm, since its way cheaper than PC.

Unfortunately I currently don't have time to bring PC support to the module. But would accept an PR for this, if you have time to put something together.

curtis-trynow-io commented 2 years ago

I am fairly busy but am fine with taking this on and putting up a PR since I am the only one asking for it to be supported 😅 . Thank you for the prompt reply. I'll try to get something up in the next week or so.

JT-Bruch commented 2 years ago

@curtis-trynow-io were you able to get this in place? The reason I ask is that as a startup we need to have top notch SEO and these cold starts are killing our technical SEO scores. I have a lambda that invokes our site every minute to keep 1 alive, however, it still is not enough to keep things happy.

If not - I can put together a PR in order to get provisioned concurrency up for this module. Let me know.

khuezy commented 2 years ago

@JT-Bruch what does your EventBridge look like? It would nice to have an opt-in parameter to automatically create EventBridge triggers to keep lambdas warm (every 5m).

I manually created a rule scheduled every 5 minutes to keep my site warm. Function: [your-app]__NEXT_PAGE_LAMBDA_0 Configure input: Constant (JSON text)

{"requestContext": {"http": {"method": "GET"}}, "headers": {"x-nextjs-page": "/index"}}

NOTE: you may need to update your page, or add multiple rules for specific pages.

curtis-trynow-io commented 2 years ago

@JT-Bruch no, I never had time to put this in place. We went with a simpler engineering approach and decided to compare our app against a vanilla NextJS app to see what other things we could do. This led to us ripping out almost everything from _app.tsx and where appropriate instead using dynamic imports: https://nextjs.org/docs/advanced-features/dynamic-import. Even after that though, a cold start will run about 6 seconds but that is more acceptable for us since we are not focused on SEO.

We wound up even adding logs throughout our code for capturing elapsed times and reasoned that during a cold start, our .env.local data gets loaded and then nothing happens for several seconds. It is all NextJS initialization as far as we can tell as things are processed for the first time. Subsequent requests process in roughly 200 milliseconds most times.

@khuezy that will only keep one Lambda container warm. If you ever have 2 requests come in with any overlap such that the single container is busy, you will get another cold start and a bad user experience. Provisioned concurrency allows you to keep as many hot containers running as you are willing to pay for.

JT-Bruch commented 2 years ago

@curtis-trynow-io sorry, was on vacation and that makes total sense. I ran into the same issue even hosting the application on vercel, which leads me to believe its not an AWS issue.

@khuezy that is the issue I am running into, there happens to be enough concurrent requests to cause a second cold start when GTMetrix hits the site.

khuezy commented 2 years ago

Doesn't Vercel use AWS?

JT-Bruch commented 2 years ago

Looks like they do, which means the infrastructure is probably the same under the hood.

ofhouse commented 2 years ago

Yes, the SSR part currently runs on AWS Lambda. However they actively working on swapping it out in favour of CloudFlare workers.

Vercel has built an internal service that invokes the Lambdas (while not executing any Next.js code) on a custom schedule directly through the AWS SDK to keep them warm. They don't use provisioned concurrency for this.

khuezy commented 2 years ago

@ofhouse What do you think about adding https://github.com/jeremydaly/lambda-warmer to the node-bridge normalizeAPIGatewayProxyEvent function?

ofhouse commented 2 years ago

Yeah, I think it would make sense to add a built-in solution to keep the Lambdas warm. 👍 Since the module aims to simplify the deployment, we should add an option where you can enable and customize the schedule how often the Lambdas should be invoked via CloudWatch Events.