jetbridge / cdk-nextjs

Deploy a NextJS application using AWS CDK
https://constructs.dev/packages/cdk-nextjs-standalone
Apache License 2.0
273 stars 45 forks source link

403 error on static files during deploy #211

Closed anthonyroach closed 2 months ago

anthonyroach commented 7 months ago

version: 4.0.0-beta.25

If I try to access my nextjs site while a deploy is happening I'll often get 403 errors on all the static files.

There was a previous issue similar to this, but I'm using a version with the fix for this old issue and am still seeing the issue.

Here’s what I think might be going on…

The altered .js, .css, and .woff2 files are given unique file names for each nextjs build (when they change). So when you deploy a new nextjs build it uploads all the new files, and then deletes all the old ones that don't exist anymore.

But the s3 files are deployed separately from the server lambda, and the server lambda responds with HTML that references the .js, .css and .woff2 files. So the server lambda may return HTML containing the new file names before the S3 files are deployed, or the old file names after the S3 files are deployed. Both cases will result in 403 errors from S3 because the files don’t exist.

The only way I can see to fix this is to do two things:

I'm not seeing any way to do the above without forking. The bucket prune setting isn't exposed and is hard coded as true.

anthonyroach commented 7 months ago

Actually looks like prune can be set without a fork:

        nextjs: {
          nextjsStaticAssetsProps: {
            prune: false,
          },
        },

And the dependency can also be added like so:

    nextjs.serverFunction.lambdaFunction.node.addDependency(
      nextjs.staticAssets,
    );

I haven't tested the above, yet.

anthonyroach commented 7 months ago

I tested the above configuration and it eliminated the 403 errors on deploy.

IMHO, the above settings should be the default because having your site break on deploy is not good.

bestickley commented 7 months ago

@anthonyroach, look at the below code from here:

// must find old object keys before uploading new objects so we know which objects to prune
      const oldObjectKeys = await listOldObjectKeys({
        bucketName: props.destinationBucketName,
        keyPrefix: props.destinationKeyPrefix,
      });
      if (!props.zip) {
        debug('Uploading objects to: ' + props.destinationBucketName);
        await uploadObjects({
          bucket: props.destinationBucketName,
          keyPrefix: props.destinationKeyPrefix,
          filePaths,
          baseLocalDir: sourceDirPath,
          putConfig: props.putConfig,
        });
        if (props.prune) {
          debug('Emptying/pruning bucket: ' + props.destinationBucketName);
          await pruneBucket({
            bucketName: props.destinationBucketName,
            filePaths,
            baseLocalDir: sourceDirPath,
            keyPrefix: props.destinationKeyPrefix,
            oldObjectKeys,
          });
        }
      }

see how we're being careful to get old objects keys, uploading new objects, then pruning old objects? This order shouldn't result in any 404s. What am I missing?

Pruning is desired because you don't want a bunch of stale static files sitting in your S3 bucket, right?

anthonyroach commented 7 months ago

It's a race condition. If the static files get deployed before the server lambda, then the old files will have been deleted while the lambda is still serving HTML referencing the old static files that have been deleted. If the lambda deploys first then it will start serving HTML that references new static files that don't exist yet. Either way you get 404/403.

If the static files need to get pruned, then they need to be pruned after there are no more old server lambda instances running, but the new static files need to get deployed before any new lambda instance start running the new code. The only way I can see to satisfy both these conditions is to either not prune or prune much later on in some kind janitor or lifecycle outside of the deploy. And in either case the new lambda can't be deployed until the static files have been deployed.

bestickley commented 7 months ago

I didn't know the server lambda holds static references to files in s3. I thought the server lambda read from s3 for static files. Could there be any other reason for this behavior? Otherwise it sounds like prune: false is best for default.

anthonyroach commented 6 months ago

Yeah, the lambda just renders the HTML from the server code inside the lambda itself, and the HTML contains URLs pointing at the static files that are then served out of S3. The server code inside the lambda is self contained and doesn't reach out to the bucket directly as far as I know.

ansgarS commented 5 months ago

We're also facing this issue quite often... even after days of the last deployment.

Is it maybe possible that the index.html is cached for too long on the client? I guess that's something one could configure in the server code / distribution behavior?

bestickley commented 5 months ago

@anthonyroach and @ansgarS, I understand the issue now. If user fetches website before deployment, and doesn't refresh after deployment, the HTML file in browser tab will reference old JS assets and result in 404s. I agree default behavior should be prune: false.

Ideally, the NextjsBucketDeployment lambda function would prune files that are X days old b/c I think it's safe to assume a user won't have browser tab open for more than X days. Maybe X = 30? Don't want static assets piling up, although cost would likely be negligible - more for cleanup purposes.

ansgarS commented 5 months ago

Interestingly, ...

Something else, I noticed. These chunk load errors are always connected with some SVG (it is our icon lib)

Screenshot 2024-04-29 at 15 12 15
bestickley commented 5 months ago

@ansgarS, when was your last non-prune deployment? Could someone have opened a browser tab before then and resulted in this error?

ansgarS commented 5 months ago

The last deployment with prune:true was on April 19th and yes, our application is not used very frequently. Could be possible. We will wait for another week and see if it still happens :)

anthonyroach commented 5 months ago

Browser caching certainly makes this error more likely to happen, but it can even happen with caching disabled if you hit the page right in the middle of the deploy. That happened to us quite a few times when our app was under heavy development and was being deployed very frequently throughout the day. Turning off pruning and adding the dependency between the server lambda and the bucket deployment has totally resolved the problem for us. Haven't had a single error since we rolled that out back in March.

ansgarS commented 4 months ago

Okay, I think we could solve it by another fix: we were loading our icon lazily like this:

import { lazy as _lazy } from 'react';

function lazy(importFn: { (): Promise<typeof import('./assets/*/*.svg')> }) {
    return _lazy(async () => {
        const m = await importFn();
        return { default: m.ReactComponent };
    });
}

export const getIcon = (
    iconName: IconNameType
): React.LazyExoticComponent<React.FunctionComponent<React.SVGProps<SVGSVGElement>>> => {
    const [iconDir, iconFileName] = ICON_PATHS[iconName];

    return lazy(async () => import(`./assets/${iconDir}/${iconFileName}.svg`));
};

Removing the lazy-loading also squashed the ChunkLoadError for us

ansgarS commented 4 months ago

Which means, we probably had another issue than you had 🥶

bestickley commented 2 months ago

While a better fix is described in my previous comment, for now I'll resolve this by making prune default to false.