serverless-nextjs / serverless-next.js

⚡ Deploy your Next.js apps on AWS Lambda@Edge via Serverless Components
MIT License
4.48k stars 456 forks source link

Zero-downtime deployments #2400

Open thijsdaniels opened 2 years ago

thijsdaniels commented 2 years ago

tl;dr

The lambdas are unreachable for a while during deployment due to CloudFront referring to the old function version which no longer exists. Since function.latestVersion is not allowed in Lamba@Edge, I think the next best thing would be to retain the previous version until deployment is complete, and then clean up old function versions.

More Info

Is your feature request related to a problem? Please describe.

At the moment, using the latest versions of Builder from @sls-next/cdk-core and NextJSLambdaEdge from @sls-next/cdk-construct (I haven't tested the serverless component), all lambdas are unreachable for about 3 minutes while the CloudFront distribution is being updated. All responses that were not already in CloudFront's edge cache return a 503 during this time, until the CloudFront distribution is done updating.

I believe this is because the previous versions of the lambdas are being deleted (as per the currentVersion.deletionPolicy), and the CloudFront distribution uses the functionVersion: this.defaultNextLambda.currentVersion. At the moment, CloudFormation first creates the new lambda version and deletes the previous version, and then tells CloudFront to start using the new version. This means that for a short while, the distribution still refers to the ARN of the now deleted previous version. Once the distribution is done deploying, it refers to the new ARN and all the responses work again.

Describe alternatives you've considered

I have checked what happens when I change the lamba versions' deletionPolicy to RETAIN, and indeed I no longer experience the 503 errors in that case, even for pages that use getServerSideProps. We could enable lambda retention for now in our projects, and clean up old versions manually. This wouldn't be ideal in my opinion though, because all devs would need to remember, or otherwise build their own automated cleanup solution.

I've also tried using functionVersion: this.defaultLambda.latestVersion as I'm sure you have as well, only to be told off by AWS that this is not supported for Lambda@Edge (but why though 😢). I've also looked into retrieving the lambda by its "live" alias, but this would only work if the previous lambdas are retained, in which case the alias itself doesn't add anything.

Describe the solution you'd like

What I think would be a decent solution (unless there's an actual proper way to do this in AWS that I haven't found; please let me know if that's the case :D), is to first create the new version of the lambda without deleting the current version (i.e. DeletionPolicy.RETAIN), then update the CloudFront distribution, and then delete the previous version(s) of the lambda, all as part of the CDK/SLS deployment.

Is there a way of implementing this that doesn't feel like a workaround? (Wouldn't it be nice if AWS had a FunctionVersionDeletionPolicy.RETAIN_ONE or something 😏) And I'm also curious: does anyone else experience this behavior, or did I implement something incorrectly?

jlegreid commented 2 years ago

@thijsdaniels We are also having this issue, except in our case we have had our lambdas be unreachable for upwards of 15 minutes during a build, but this generally only happens after we haven't pushed code for a few days, like after a weekend. This only started happening about a month and a half ago for us.

Thanks for the deletionPolicy tip, we will try that for now so at least our visitors don't experience the 503 errors.

thijsdaniels commented 2 years ago

@jlegreid and anyone else trying to work around this:

If you're using CDK and you're looking for an automated solution, you could use a trigger to clean up the old lambda versions after deployment.

...
import { NextJSLambdaEdge } from "@sls-next/cdk-construct";
import { PolicyStatement } from "aws-cdk-lib/aws-iam";
import { Code, Runtime } from "aws-cdk-lib/aws-lambda";
import { TriggerFunction } from "aws-cdk-lib/triggers";

export class MyStack extends Stack {
  constructor(...) {
    super(...);

    const nextApp = new NextJSLambdaEdge(...);

    const pruneLambdas = new TriggerFunction(this, "PruneLambdas", {
      runtime: Runtime.NODEJS_14_X,
      handler: "index.handler",
      code: Code.fromAsset(path.join(__dirname, "./pruneLambdas")),
      environment: {
        DEFAULT_LAMBDA_ARN: nextApp.defaultNextLambda.functionArn,
        IMAGE_LAMBDA_ARN: nextApp.nextImageLambda?.functionArn,
        API_LAMBDA_ARN: nextApp.nextApiLambda?.functionArn,
      },
      initialPolicy: [
        new PolicyStatement({
          actions: ["lambda:ListVersionsByFunction", "lambda:DeleteFunction"],
          resources: [
            nextApp.defaultNextLambda.functionArn,
            nextApp.nextImageLambda?.functionArn,
            nextApp.nextApiLambda?.functionArn,
          ],
        }),
      ],
    });

    pruneLambdas.executeAfter(nextApp);  
  }
}

The ./pruneLambdas/index.js (no TS support afaik) could then use the aws-sdk to retrieve and delete all but the latest versions of each lamdba.

I haven't tested this yet, but I'll give it a go some time soon and post the results here for anyone interested, as I think this might be a decent solution for the @sls-next/cdk-construct as well.

thijsdaniels commented 2 years ago

Strangely, I can no longer reproduce the 503s. No matter what I do and regardless of what currentVersion.removalPolicy I specify, AWS always retains the old versions of the lambdas. I'm not sure what causes this, and since when.

As far as I'm concerned though, this feature request can be closed and instead I will open a new feature request specifically for cleaning up old lambdas.

If there are no objections, I'll close this issue in a couple of days.

jlegreid commented 2 years ago

@thijsdaniels Thank you for all your help with this. That's interesting that suddenly AWS is retaining all old versions, I wonder what would have changed? The retention policy in the sls-next package seems to all still be set to DESTROY so its odd AWS suddenly wouldn't respect that. We only ran into the 503 issue after having not pushing code for a few days, so it was really hard to reproduce. It seemed to always coincide the AssetDeploymentstaticPagesCustomResource step taking significantly longer than average, even when the number of static pages shouldn't have changed. I'm not sure if you were seeing the same thing or if you were able to more consistently reproduce it before the Lambda retain change. Have you been able to test the pruneLambdas trigger? I was going to try to implement something similar on our end, but if you are planning on opening a PR soon I may just wait. If you let me know when you open that I can have my team express their support as well. I feel uneasy about trusting that the sudden retention of Lambdas was an intentional change, so I wonder if we should also open a PR with sls-next to permanently change the policy for defaultNextLambda to be RETAIN https://github.com/serverless-nextjs/serverless-next.js/blob/e7af51980079a59fbb9ecc3d35c878c62b919fc7/packages/serverless-components/nextjs-cdk-construct/src/index.ts#L156 in addition to the pruneLambdas PR. I can open the policy change PR.

thijsdaniels commented 2 years ago

@jlegreid It occurred to me that it probably makes sense that the old function versions don't get removed despite the currentVersion.removalPolicy when those functions are used as Lambda@Edge. After creating the new version, the lambda.Function construct sends a deleteFunction request to remove the previous version, but I think that request fails (silently apparently), because that version of the lambda is still replicated across the CloudFront edge locations at that time. I actually suspect that this has always been the case, and that the 503s in my case were caused by something else. The case you're describing for seeing the 503's doesn't ring a bell to me, so I'm afraid I won't be much help with that.

Regardless, automatically removing the old versions seems like a nice feature to have, so I just finished testing the prune trigger, and it works well enough to use in production as far as I'm concerned. There are two things I don't really like about my implementation, but they aren't dealbreakers in my opinion:

  1. The pruning is always one or two lambda versions behind. Since the previous lambda is only just detached from the CloudFront behavior by the time the trigger runs, its deletion will fail initially, which I ignore with a try/catch. It will eventually be deleted on one of the future deployments after CloudFront has had enough time to propagate the change to all edge locations.
  2. The prune trigger itself also gets updated on every deployment (though its older versions do get deleted automatically by the default removalPolicy, since it is not a replicated function) because I include the current version number of each lambda in the environment variable. This is because I didn't want to derive the current version of each lambda from the versions returned by listVersionsByFunction because that response is paginated per 50 versions, and also because I didn't want to make assumptions about the format of the version strings.

Anyway, here are the relevant snippets:

// packages/serverless-components/nextjs-cdk-construct/src/index.ts

...
import { Function as Lambda, Runtime, Code } from "aws-cdk-lib/aws-lambda";
import { Distribution } from "aws-cdk-lib/aws-cloudfront";
import { TriggerFunction } from "aws-cdk-lib/triggers";
import { PolicyStatement } from "aws-cdk-lib/aws-iam";

class NextJSLambdaEdge extends Construct {
  ...
  public readonly pruneTrigger: TriggerFunction;

  public constructor(...) {
    ...

    this.pruneTrigger = this.createPruneTrigger(
      [
        this.defaultNextLambda,
        this.nextApiLambda,
        this.nextImageLambda
      ].filter<Lambda>((lambda): lambda is Lambda => !!lambda),
      this.distribution
    );
  }

  ...

  protected createPruneTrigger = (
    lambdas: Lambda[],
    distribution: Distribution,
  ): TriggerFunction => {
    const pruneTrigger = new TriggerFunction(this, "PruneTrigger", {
      runtime: Runtime.NODEJS_14_X,
      handler: "index.handler",
      code: Code.fromAsset(path.join(__dirname, "./functions/prune")),
      environment: {
        LAMBDAS: JSON.stringify(
          lambdas.map((lambda) => ({
            arn: lambda.functionArn,
            exclude: [lambda.currentVersion.version],
          })),
        ),
      },
      logRetention: RetentionDays.THREE_DAYS,
      initialPolicy: [
        new PolicyStatement({
          actions: ["lambda:ListVersionsByFunction"],
          resources: lambdas.map((lambda) => lambda.functionArn),
        }),
        new PolicyStatement({
          actions: ["lambda:DeleteFunction"],
          resources: lambdas.map((lambda) => `${lambda.functionArn}:*`),
        }),
      ],
    });

    pruneTrigger.executeAfter(distribution);

    return pruneTrigger;
  };
}
// packages/serverless-components/nextjs-cdk-construct/src/functions/prune/index.js

const Lambda = require("aws-sdk").Lambda;

const lambdas = JSON.parse(process.env.LAMBDAS ?? null) ?? [];
const client = new Lambda();

exports.handler = async () => {
  for (const lambda of lambdas) {
    const versions = (
      await client
        .listVersionsByFunction({
          FunctionName: lambda.arn,
        })
        .promise()
    ).Versions;

    const candidates = versions.filter(
      (version) => !["$LATEST", ...lambda.exclude].includes(version.Version),
    );

    for (const version of candidates) {
      try {
        await client
          .deleteFunction({
            FunctionName: version.FunctionName,
            Qualifier: version.Version,
          })
          .promise();
      } catch (error) {
        switch (error.code) {
          case "InvalidParameterValueException":
            /**
             * No-op: Presumably the lambda is still replicated on some of the
             * edge locations and therefore fails to delete. It will be deleted
             * after a future deployment.
             */
            console.log(error);
            break;
          default:
            throw error;
        }
      }
    }
  }
};
jlegreid commented 2 years ago

@thijsdaniels Nice, thanks for putting all that together. I still am not sure the actual source of our 503 issue either, but ever since changing the deletion policy to retain it hasn't happened again, unless there was just a lucky coincidence with something on the AWS side of things. Are you going to open a PR with your prune trigger? Or what's your next step from here.

gengoro commented 2 years ago

There are other factors for downtime besides the lambda version of deletionPolicy.

If the bundle id of an asset distributed in S3 is changed, the past assets will be deleted. A 404 will occur until the new lambda version is reflected in cloudfront. To avoid this, the prune option of BucketDeployment must be set to false.

https://github.com/serverless-nextjs/serverless-next.js/blob/master/packages/serverless-components/nextjs-cdk-construct/src/index.ts#L418-L454

jlegreid commented 2 years ago

@gengoro thanks for this tip. In your experience could this also lead to a lambda returning a 503? I haven't seen any spikes in 404 errors during our deployments, only 503s, but still may be related/