sst / open-next

Open source Next.js serverless adapter
https://open-next.js.org
MIT License
3.75k stars 113 forks source link

Full support for ISR #57

Closed conico974 closed 1 year ago

conico974 commented 1 year ago

Hello,

As of now it seems that ISR is not working as expected. If the page is not found in the CDN (and cloudfront doesn't support stale-while-revalidate) it seems to work as if it was a SSR page.

I’ve found 2 possible solutions to this issue, but I’m not sure which one is the best :

I’d be happy to open a PR to solve this issue if I get some feedback from the maintainers or contributors.

ferdingler commented 1 year ago

The Next.js team has been working on what appears to be an extensible ISR cache that doesn't rely on disk, but rather on HTTP requests: https://github.com/vercel/next.js/blob/canary/packages/next/src/server/lib/incremental-cache/fetch-cache.ts#L28. It's still on experimental mode, and at the moment it appears to be Vercel-specific, which is not surprising 🙄 . But this seems like the long term solution for ISR and On-Demand Revalidation when self-hosting outside of Vercel. Basically write your own IncrementalCache implementation (similar to the link above) that writes files to S3.

dalhaan commented 1 year ago

I've just started playing around with open-next today to test its viability (as some clients prefer to own their architecture so using Vercel or Netlify isn't an option in some cases) and have also run into this same issue.

ISR with open-next seems to essentially be SSR but relies on CloudFront to provide the ISR capabilities by caching the page for the invalidation duration with the cache-control: s-maxage=<INVALIDATION_DURATION> header. This has unexpected outcomes as the HTML page and page data JSON will often be cached at different times, resulting in the initial page visit (served HTML) being different from client-side navigation (served page data JSON). This also results in different pages being served from different edge locations as it only gets cached at the specific edge location that served the page.

I've been thinking about a potential solution to this but I'm not sure about it's feasibility so keen for feedback. Next generates a .next/prerender-manifest.json file which contains the routes that are either SSG or ISR, along with their invalidation durations. This could be used to:

  1. copy SSG and ISR HTML & page data JSON files to the generated "assets" directory on open-next build (which will be uploaded to the S3 bucket)
  2. when the server-function lambda gets a request, it could reference the .next/prerender-manifest.json file and if the request is for a SSG or ISR route or page data, it could fetch and respond with the HTML or page data JSON file from S3.
  3. for ISR routes/page data, it could then compare the last modified date of the HTML or JSON file, reference the .next/prerender-manifest.json file to find the invalidation duration,
  4. if it has elapsed, build and replace the new HTML & page data JSON in the S3 bucket.

This functionality would also need to be bypassed if preview mode is enabled to have a SSR experience for every route.

My concerns with my proposed solution are:

Again, not sure if my solution is even viable/realistic but keen for feedback. fixed-open-next-functionality

ricoli commented 1 year ago

if a lambda cannot do extra work after it has responded (unsure of this as I don't have much experience). If it can't it could call another lambda where generating and saving html & page data for routes is its only job.

Yeah, unfortunately with async handlers (which is the case with open-next, and the recommended AWS approach), the lambda will terminate execution as soon as the response is returned. This means we can't initiate the revalidation of a page without awaiting it, as it likely won't finish by the time the stale response is returned. It's not much better with non-async handlers anyway, as the event loop is paused when the response is returned and is only resumed if and when the same lambda "executor" is subsequently reused, which might not happen in a timely manner, or at all! I'm wondering if open-next should have a lambda function dedicated to revalidating a page that the main lambda could asynchronously invoke, and it would be that lambda's job to build the new page version and save it to s3? Or perhaps the main lambda could send a message to an SQS queue that then triggers the revalidation lambda (which is what the abandoned serverless nextjs component does)...

dalhaan commented 1 year ago

Yeah that's along the lines of what I was thinking. The dedicated ISR Function could potentially use the NextServer (like what the Server Function uses) which could be given the request for the route/page data, which would generate fresh HTML & page data JSON files for the route, then upload them to S3. Yeah some sort of event deduping would be necessary, so multiple ISR Functions aren't triggered to build the same route, which could be achieved with some sort of deduping event queue or just some sort of cache or DB for marking whether a certain route is being generated or not.

ricoli commented 1 year ago

Yeah that's along the lines of what I was thinking. The dedicated ISR Function could potentially use the NextServer (like what the Server Function uses) which could be given the request for the route/page data, which would generate fresh HTML & page data JSON files for the route, then upload them to S3. Yeah some sort of event deduping would be necessary, so multiple ISR Functions aren't triggered to build the same route, which could be achieved with some sort of deduping event queue or just some sort of cache or DB for marking whether a certain route is being generated or not.

AWS SQS FIFO queues support content based de-duplication, which is likely why the serverless nextjs component chose FIFO queues... funnily enough I think it was disabled by default :laughing: but yeah, SQS FIFO queues can do that

dalhaan commented 1 year ago

Awesome, yeah that could be a good option.

dalhaan commented 1 year ago

This is along the lines of what I am thinking, although this could be improved if the preview mode & SSG/ISR route check could happen in a lambda@edge, so you don't have to worry about cold starts/distance from viewer to server function lambda when you just need to respond with a file from S3. The lambda@edge could also try to directly map a route request to its HTML file location in the S3 (/isr-route -> /.next/server/pages/isr-route.html) for quicker response times on a cache miss. activity-diagram

conico974 commented 1 year ago

I've been working on a different solution for this : #58 I've disabled minimalMode Basically what i do is that i created an Incremental Cache handler that i inject into next Server using experimental.incrementalCacheHandlerPath config. The cache handler reads and write the cached file (json html or rsc) directly to an S3 cache bucket Since we cannot continue the execution of the lambda once we have returned data from the cache, i check for x-nextjs-cache on the response from the server and if it's stale or miss, i just fire and forget a head request with x-prerender-revalidate header which revalidate the page data. The only issue remaining is that cloudfront could have different cache entry for the html and the page data, but i don't see a way to handle this issue.

conico974 commented 1 year ago

This is along the lines of what I am thinking, although this could be improved if the preview mode & SSG/ISR route check could happen in a lambda@edge, so you don't have to worry about cold starts/distance from viewer to server function lambda when you just need to respond with a file from S3. The lambda@edge could also try to directly map a route request to its HTML file location in the S3 (/isr-route -> /.next/server/pages/isr-route.html) for quicker response times on a cache miss. activity-diagram

Maybe i didn't get it, but how do you handle the case where your data are stale ? S3 would return the data but no revalidation would happen. One of the solution would be to have a lambda@edge in front of S3, but i think that it will add a lot of complexity and it could potentially be a pain to maintain in the future if next.js go a different direction.

dalhaan commented 1 year ago

When the server function gets the html or page data back from S3 and it is stale, it would trigger the hypothetical ISR function lambda, then respond with the stale data. The Server Function may also need to set cache-control: s-maxage=0 on the response, so CloudFront doesn’t cache the stale file again, then the next request for that file will be fresh and get cached.

dalhaan commented 1 year ago

Your solution sounds really interesting. Does the revalidation happen in the Server Function before it responds, or does it trigger a separate lambda?

conico974 commented 1 year ago

No the revalidation just fire a http HEAD request in preview mode to the NextServer lambda, but it doesn't wait for the response https://github.com/conico974/open-next/blob/cb55c784a4a6d8c688521b9a2806f391d5c6ee62/src/adapters/server-adapter.ts#L109. It also modify the cache-control header so that the next request use NextServer

dalhaan commented 1 year ago

I really like your solution as it uses the NextServer to handle the ISR, really reducing the amount of custom logic needed.

I've been testing out your fix and it has been working great, although I have noticed sometimes it can take up to ~9 seconds to get a response on a fresh page. It kept timing out when I first deployed it when trying to visit a SSG page but I got a response when I increased the server function timeout to 10 secs. I've also noticed it a couple of times on ISR routes when I haven't visited them for a while.

isr timeout
conico974 commented 1 year ago

It seems like a cold start issues. I was having a similar issues because the node_modules of my lambda contained a lot of dev dependencies like @swc/core. I added this to my next config :

experimental: {
    outputFileTracingIgnores: [
      '**swc+core**',
      '**esbuild**',
      '**sharp**',
      '**/node_modules/*webpack*/**',
    ],
  },

Cold starts is really an issue with open-next as of now

dalhaan commented 1 year ago

Ah great I will need to try that out. I think we could get some massive speed wins building upon your solution by using it as a dedicated ISR lambda alongside the current Open Next server function for cache misses, SSR and API routes.

We could use an edge lambda to determine if the requested route is static, ISR or dynamic and get the route's HTML and page data directly from the S3 cache. Then it could decide whether to trigger your ISR lambda to regenerate the page (if it is stale or a miss), and/or forward the request to the current Open Next server function (if it is a miss, SSR or API route) as the cold starts don't seem to be as bad in minimal mode.

That way we could get super fast responses for cached routes and the long cold starts don't need to be experienced by the user as the non-minimal mode lambda is only run as a background task for updating the cache.

conico974 commented 1 year ago

I've looked a little bit more into it and, for me at least, the Open next function cold start is as fast (or as slow) in minimal mode or not. I think that this issue is not only present in Open Next but also on Vercel https://github.com/vercel/vercel/discussions/7961

I've also tried implementing a lambda@edge function in front of the Open Next function and i'm really not convinced about it. It adds between 400-600 ms on cold start and if it's a miss SSR or API route, we still have the cold start from the Open Next function for those routes. So what i've done in #58 is that i intercept all the call to the Next.js request handler before even starting the Next server. Then, we check in the prerender manifest and in the cache for this page and either return the page if it's a hit or invoke the Next handler.

I'm not sure if this shoud be in this repo to be honest. Maybe adding this as a flag like --experimental-cache-interceptor could be an option, because this could break on every release of next.js. For the moment it's not working with the app directory for example.

khuezy commented 1 year ago

Great discussions. @conico974 can you try sst 2.4.0 which uses open next 0.8.0, which removes minimalMode? I heard AWS will launch swr in a couple weeks, I'm not sure if that'll resolve our ISR issues. As far on demand ISR, your design looks interesting @dalhaan. Note that w/ the latest version, we're no longer hitting middleware at VIEWER_REQUEST since we've removed minimalMode.

khuezy commented 1 year ago

The Next.js team has been working on what appears to be an extensible ISR cache that doesn't rely on disk, but rather on HTTP requests: https://github.com/vercel/next.js/blob/canary/packages/next/src/server/lib/incremental-cache/fetch-cache.ts#L28. It's still on experimental mode, and at the moment it appears to be Vercel-specific, which is not surprising 🙄 . But this seems like the long term solution for ISR and On-Demand Revalidation when self-hosting outside of Vercel. Basically write your own IncrementalCache implementation (similar to the link above) that writes files to S3.

This looks very interesting! CC: @fwang

khuezy commented 1 year ago

I really like your solution as it uses the NextServer to handle the ISR, really reducing the amount of custom logic needed.

I've been testing out your fix and it has been working great, although I have noticed sometimes it can take up to ~9 seconds to get a response on a fresh page. It kept timing out when I first deployed it when trying to visit a SSG page but I got a response when I increased the server function timeout to 10 secs. I've also noticed it a couple of times on ISR routes when I haven't visited them for a while.

isr timeout

This was b/c there were 2 lambdas, each causing the cold start. Each lambda takes 3-4 seconds in cold start b/c of the large package size. I have a PR to reduce the package which I'm using via patch and my cold starts are ~1s: https://github.com/serverless-stack/open-next/pull/40

conico974 commented 1 year ago

Great discussions. @conico974 can you try sst 2.4.0 which uses open next 0.8.0, which removes minimalMode? I heard AWS will launch swr in a couple weeks, I'm not sure if that'll resolve our ISR issues. As far on demand ISR, your design looks interesting @dalhaan. Note that w/ the latest version, we're no longer hitting middleware at VIEWER_REQUEST since we've removed minimalMode.

I was trying it right now. With open next 0.8.0 it's even worse since by default next.js tries to store the cache on disk and in memory. Since it's a lambda we can sometimes be in a state where a page has been revalidated on an instance of a lambda and not on another, and as soon as the lambda are terminated we lost the revalidation and go back to using the page has it was on build.

The support for swr from AWS won't help us without minimalMode since the revalidation is happening in the background in the lambda.

I just updated my PR to work on 0.8.0, maybe you could give it a try @khuezy : #58

khuezy commented 1 year ago

Nice work, I don't have a use case for ISR at the moment, but I'll try to play around w/ your PR in the coming weeks when I have a scenario.

juliocorzo commented 1 year ago

Seconding SQS FIFO queues. In serverless-next, a page with ISR forces the server to write to queue, and another Lambda is then executed to regenerate page based on the allotted ISR time. I have only validated this in Next.js 12 (as that's where serverless-next died, but happy to provide diagrams/explanation of how this was implemented if useful. I use Terraform and CloudFormation for managing infrastructure so would have to rework to work with open-next.