vercel / next.js

The React Framework
https://nextjs.org
MIT License
124.86k stars 26.65k forks source link

Next.js application showcases intermittent slow responses, in Google Cloud Run instance running, from 30 seconds up to more than 1 minute #48539

Open vspathonis opened 1 year ago

vspathonis commented 1 year ago

Verify canary release

Provide environment information

Operating System:
      Platform: darwin
      Arch: arm64
      Version: Darwin Kernel Version 22.3.0: Mon Jan 30 20:38:37 PST 2023; root:xnu-8792.81.3~2/RELEASE_ARM64_T6000
    Binaries:
      Node: 16.18.1
      npm: 8.19.2
      Yarn: N/A
      pnpm: N/A
    Relevant packages:
      next: 12.3.0
      eslint-config-next: 12.3.0
      react: 18.2.0
      react-dom: 18.2.0

Which area(s) of Next.js are affected? (leave empty if unsure)

Data fetching (gS(S)P, getInitialProps), Middleware / Edge (API routes, runtime)

Link to the code that reproduces this issue

https://codesandbox.io/p/sandbox/eager-gagarin-9quspu?file=%2Fpages%2Fironsession.ts&selection=%5B%7B%22endColumn%22%3A20%2C%22endLineNumber%22%3A8%2C%22startColumn%22%3A20%2C%22startLineNumber%22%3A8%7D%5D

To Reproduce

  1. Deploy a Next.js application to Google Cloud Run.

  2. Send requests to the application and observe the response time.

Describe the Bug

I have a Next.js application running in a Google Cloud Run instance, and I'm experiencing intermittent slow responses. Sometimes the response time is around 30 seconds, but other times it can be more than 1 minute. This behaviour is not consistent and happens randomly. Right now, the only known fact is that if traffic is constant, the instance is responding very quickly, but whenever the traffic is dropped, the issue resurfaces.

The problem never happens when executing the same Docker image from a local workstation.

I have reached out to Google Cloud support, and they have informed me that the latencies are originating from the code execution.

Environment:

•   Next.js version: 12.3.0
•   Operating system:  node:16-alpine
•   Deployment platform: Google Cloud Run

Dockerfile:

# Install dependencies only when needed
FROM node:16-alpine AS deps

# Install Alpine packages
# To understand why libc6-compat might be needed, check:
# https://github.com/nodejs/docker-node/tree/b4117f9333da4138b03a546ec926ef50a31506c3#nodealpine
RUN apk add --no-cache libc6-compat

# Working directory
WORKDIR /app

# Install dependencies based on the preferred package manager
COPY package.json package-lock.json .
RUN npm ci

################################################################################
# Rebuild the source code only when needed
FROM node:16-alpine AS builder

# Working directory
WORKDIR /app

# Copy from dependencies prepared files
COPY --from=deps /app/node_modules ./node_modules

# Copy application files
COPY . .

# If using npm comment out above and use below instead
RUN npm run build

################################################################################
# Production image, copy all the files and run next
FROM node:16-alpine AS runner

# Image Build arguments
ARG GIT_HASH 'NA'
ARG GIT_BRANCH 'NA'

# Add system user
RUN addgroup --system --gid 1001 nodejs
RUN adduser --system --uid 1001 nextjs

# Copy from builder prepared files
# Public files
COPY --from=builder /app/public /app/public
# Automatically leverage output traces to reduce image size
# https://nextjs.org/docs/advanced-features/output-file-tracing
COPY --from=builder --chown=nextjs:nodejs /app/.next/standalone /app/
COPY --from=builder --chown=nextjs:nodejs /app/.next/static /app/.next/static

# User to run all commands as
USER nextjs

# Working directory
WORKDIR /app

# Environment variables
ENV NODE_ENV production
ENV PORT 3000
# Next.js collects completely anonymous telemetry data about general usage.
# Learn more here: https://nextjs.org/telemetry.
ENV NEXT_TELEMETRY_DISABLED 0
ENV BACKEND_URL https://api.example.com
ENV BROWSER none
ENV FAST_REFRESH true
# ... some other sensitive env vars here ...

# Exposed ports
EXPOSE 3000

# Container command
CMD ["node", "server.js"]

Additional information: It's worth noting that I'm not doing any API requests in the backend of my Next.js application. I'm using Iron Session for authentication, as recommended in the Next.js documentation (link: https://nextjs.org/docs/authentication).

Any help or guidance on how to resolve this issue, debugging, or possible root causes would be greatly appreciated. Thank you!

Expected Behavior

The application should respond in a timely manner, with response times consistently under 10 seconds.

Which browser are you using? (if relevant)

No response

How are you deploying your application? (if relevant)

Google Cloud Run

Fredkiss3 commented 1 year ago

I think the latency comes from cold starts. Cloud Run is a serverless docker platform, so when your app receive little traffic it get shut down, the first request will wake the docker instance but it can take a little time before your app start (node has to start, then your nextjs server has to start), every subsequent requests that comes between the amount of time your app has to stay alive is faster since everything is already started.

So it's normal than when there is little to no traffic, the app start slow. In a local docker workstation it is normal since i suppose you don't have cold starts.

I think you can specify in Cloud Run a minimum amount docker instances to run when there is no traffic so that there is at least one running instance of your app when testing.

MiltiadisKoutsokeras commented 1 year ago

Hi, I work with @vspathonis on the issue. Our deployment has a minimum of 50 Cloud Run instances waiting for traffic and also we have disabled CPU throttling, which allocates CPU during the whole lifetime of the container instance. We have been in communication with Google support and they have pointed out that code execution is at fault instead of their infrastructure. From our testing, it seems that the problem only surfaces when the Docker image is launched via Cloud Run and is NEVER reproduced in a local workstation or other types of Cloud virtual machines.

What we would like is get advise on how to debug the issue in order to pinpoint the root cause, as we cannot get more information from Google which is our production deployment environment.

MiltiadisKoutsokeras commented 1 year ago

To help even further, it seems that our landing page consumes 30+ seconds to redirect to the login page and another 30+ seconds to load the login page. Our corresponding back-end service (login/logout, data fetch, data push) is responding within milliseconds for most requests and our monitoring logs showcase up to 3 seconds of response time for the most processing heavy API endpoints. So the API back-end is not responsible for the delays. It seems that delays are coming mostly from the API Route processing, which mostly acts as a proxy to perform the back-end requests and save sensitive account data to the Iron Session encrypted cookie and/or return fetched JSON data to the UI. I am starting to suspect the delays are caused by the API Route processing (not UI) getting very low CPU resources even though the Cloud Run instance is already up and running and can handle the serving of static resources. Iron session cookies involve encryption and decryption of session cookie data so getting low CPU resources during this process may explain the phenomenon. We just need good direction on how to debug and profile the performance of the Next application.

Fredkiss3 commented 1 year ago

I'm gonna sound stupid but what about console.log or console.time (& console.timeEnd) to validate your hypothesis ?

Since the issue only happens on GCP it is kinda hard to help you debug because not everyone can easily deploy to GCP.

MiltiadisKoutsokeras commented 1 year ago

Thanks for the advise. We have added some extra logs in our application and it seems that we get a 40 seconds delay between the initialization of the node server (CMD ["node", "server.js"]) in the container startup and the first call of Server Side Rendering (getServerSideProps of of index.tsx). See the attached screenshot from the Google Cloud logs.

MiltiadisKoutsokeras commented 1 year ago

It does not seem I can attach it directly, give me some moments to upload it to a network share.

MiltiadisKoutsokeras commented 1 year ago

Console logging messages taken from Google Cloud Run logs here.

From what we can see here the server starts at 21:43:47.957 (message Listening on port 3000) and the getServerSideProps function called withSessionSSR is getting called from the index page at 21:44:27.202, which is nearly 40 seconds later. How can we find out what is going on between the node server being ready on port 3000 and the index page getServerSideProps?

Here is some snippets of our code to help you understand the workflow (I cannot disclose whole files due to company policy):

ironsession.ts

...
// Iron Session for cookie encryption https://github.com/vvo/iron-session
import type { IronSessionOptions } from "iron-session"
import { withIronSessionApiRoute, withIronSessionSsr } from "iron-session/next"
...
// Wrapper for applying Iron Session options in Server Side Rendering pages
export function withSessionSSR<P extends { [key: string]: unknown }>(
    handler: (
        context: GetServerSidePropsContext,
    ) => GetServerSidePropsResult<P> | Promise<GetServerSidePropsResult<P>>,
) {
    // eslint-disable-next-line no-console
    console.log(
        "[IRON SESSION] - withSessionSSR function triggered - TIME: ",
        new Date(),
    )
    return withIronSessionSsr(handler, ironSessionOptions)
}
...

index.tsx

...
import { withSessionSSR } from "../lib/ironsession"
...
const getServerSideProps: GetServerSideProps = withSessionSSR<IndexProps>(
    async ({
        req,
    }): Promise<
        | {
              props: IndexProps
          }
        | Redirectable
    > => {
        // eslint-disable-next-line no-console
        console.log("[INDEX] - Init server side - TIME: ", new Date())

        // Gets logged in user information from iron session here
        // ... 

        if (!logged_in) {
            return {
                redirect: {
                    permanent: false,
                    destination: "/login",
                },
            }
        }

        return {
            redirect: {
                permanent: true,
                destination: "/landing_page",
            },
        }
    },
)

export { getServerSideProps }
MiltiadisKoutsokeras commented 1 year ago

Running the same Docker image via Docker compose locally on my workstation showcases the following:

2023-04-19T08:49:48.111582942Z info  - Loaded env from /app/.env
2023-04-19T08:49:48.177201634Z Listening on port 3000
2023-04-19T08:49:52.657577486Z [IRON SESSION] - withSessionSSR function triggered - TIME:  2023-04-19T08:49:52.656Z
2023-04-19T08:49:52.663600095Z [INDEX] - Init server side - TIME:  2023-04-19T08:49:52.663Z
2023-04-19T08:49:52.683672238Z [INDEX] - Init server side - TIME:  2023-04-19T08:49:52.683Z

We see a 4 second delay between the messages Listening on port 3000 and [IRON SESSION] ... which is 10 times less than what is showcased in Google Cloud but still I think is way too much for an application startup. How can we find out and profile what is going on between the node server readiness and the Next SSR function?

MiltiadisKoutsokeras commented 1 year ago

I have created a small curl script to measure website response latency and look what is the behavior of the local container after startup:

Testing Website Response Time for :http://127.0.0.1:3000/

Lookup Time:        0.000021
Connect Time:       0.000096
AppCon Time:        0.000000
Redirect Time:      0.000000
Pre-transfer Time:  0.000116
Start-transfer Time:    5.105965

Total Time:     5.106050
Testing Website Response Time for :http://127.0.0.1:3000/

Lookup Time:        0.000032
Connect Time:       0.000134
AppCon Time:        0.000000
Redirect Time:      0.000000
Pre-transfer Time:  0.000167
Start-transfer Time:    0.008889

Total Time:     0.008973
Testing Website Response Time for :http://127.0.0.1:3000/

Lookup Time:        0.000031
Connect Time:       0.000133
AppCon Time:        0.000000
Redirect Time:      0.000000
Pre-transfer Time:  0.000163
Start-transfer Time:    0.002368

Total Time:     0.002498
Testing Website Response Time for :http://127.0.0.1:3000/

Lookup Time:        0.000032
Connect Time:       0.000138
AppCon Time:        0.000000
Redirect Time:      0.000000
Pre-transfer Time:  0.000172
Start-transfer Time:    0.002188

Total Time:     0.002249
Testing Website Response Time for :http://127.0.0.1:3000/

Lookup Time:        0.000032
Connect Time:       0.000117
AppCon Time:        0.000000
Redirect Time:      0.000000
Pre-transfer Time:  0.000150
Start-transfer Time:    0.002379

Total Time:     0.002438

The first call always lasts more than 5 seconds and then the consequent ones are in milliseconds. What this showcases is that initial request will always have much larger delay in response than any other call even on localhost. This seems to be scaled up when run in the Cloud Run containers in both delay (10x times more in some cases) and in frequency (the container seems to run node server startup every time the traffic is increased).

MiltiadisKoutsokeras commented 1 year ago

UPDATE: We have raised the issue on Google Cloud support and it is escalated. I will notify on the conclusion.

The current status is the following:

  1. Cloud Run is an auto-scale application container runtime and it can launch multiple instances of the same Docker container to serve a user session. The only limitation is the configured min/max number of instances you set. Min controls how many instances are "hot" waiting for HTTP requests and max the upper limit.
  2. We have seen in the logs, that a single HTTP index page request is not served by a single Cloud Run instance. The list of resources required to be retrieved by the index page may be served by multiple Cloud Run instances. This is confirmed by the logs where each request is mapped to an instance UUID.
  3. Breaking up a single index page browser request to multiple instances has the after effect of paying the application startup delay MULTIPLE times and things get even worse if the instances are also from containers launched due to the request (application startup + container delay).
  4. As I have shown above, a single local Docker container can have a delay of up to 5 seconds or more. In the Cloud Run environment this can be multiplied by the delay of multiple launched instances raising the delay of index page response up to 1 minute or more.
  5. The delay between calling node server.js and executing the application Typescript code first line is the culprit of the majority of the total delay. I have tried to create a minimal Next app without any extra resources (image, css, etc) and can still take up to 2 seconds between the node server launch and taking control of the execution in Typescript.
  6. A Single instance seems to serve any request after the application launch within milliseconds.

Any advise on how to bring down the delay between launching the Docker container and running the application code is highly appreciated. It is a black box for us.

I am astonished by how a simple and small in scope application can be so slow under cloud environments. :(

Fredkiss3 commented 1 year ago

@MiltiadisKoutsokeras can you share the repo (or the code) for your minimal reproduction with the corresponding Dockerfile ?

I want to test on my end.

MiltiadisKoutsokeras commented 1 year ago

@Fredkiss3 I will post it here next week as I am on the road now. Thank you.

MiltiadisKoutsokeras commented 1 year ago

@Fredkiss3 Here is the minimal project to reproduce the problem: https://github.com/MiltiadisKoutsokeras/nextjs_issue_48539

You can check the test application deployed in Google Cloud Run here.

BdanB commented 1 year ago

Hi, any updates on an issue?

vspathonis commented 1 year ago

Hi, unfortunately there is no update on the issue at this time.

The problem still persists.

MiltiadisKoutsokeras commented 1 year ago

For all interested parties, we had to migrate the Next.js application out of Google Cloud Run instances as a temporary measure.

One mitigation proposed by Google engineers is to migrate all application static content (Javascript, images, CSS) to a content delivery service to avoid serving them from the Next.js server. I do not know how this could help our use case, as the content is minimal compared to other media rich applications.

rubiagatra commented 1 year ago

Do you using https://firebase.google.com/docs/hosting/frameworks/nextjs? @MiltiadisKoutsokeras, where your mitigation plan?

lucgagan commented 11 months ago

Even if the Cloud Run instance is configured to be always on (i.e. CPU is always allocated and minimum number of instances > 1) the response time is slow... https://ray.run/ is averaging 2 seconds to produce a response.

lucgagan commented 11 months ago

For what it is worth, I found that using large memory instances helps to improve boot time.

After going from 1GB to 8GB, the start time reduced from 2 seconds to ~500 ms.

lucgagan commented 11 months ago

Cross-posted to Reddit https://www.reddit.com/r/nextjs/comments/16nzuad/nextjs_start_time_is_extremely_slow_on_google/

lucgagan commented 11 months ago

After a ton of debugging, it turned out to be a disk IO heavy operation at the start of the service.

rubiagatra commented 11 months ago

*interesting

Fredkiss3 commented 11 months ago

After a ton of debugging, it turned out to be a disk IO heavy operation at the start of the service.

Please explain... Does next do a disk operation at the start of server ? (Like creating files or something like that ?)

lucgagan commented 11 months ago

After a ton of debugging, it turned out to be a disk IO heavy operation at the start of the service.

Please explain... Does next do a disk operation at the start of server ? (Like creating files or something like that ?)

It was specific to my website because all my blog entries are in markdown that needs to be read and parsed.

MiltiadisKoutsokeras commented 1 month ago

Do you using https://firebase.google.com/docs/hosting/frameworks/nextjs? @MiltiadisKoutsokeras, where your mitigation plan?

I am very sorry for missing your question. We mitigated the issue by hosting the application on a Container running on a Cloud Virtual Machine behind a reverse proxy. We run multiple instances and the delay is paid once on first startup and never on end-user's first visit. We will eventually migrate everything to a Kubernetes cluster which will hopefully showcase the same behavior as the Virtual Machine. In Serverless environments the delay is paid on every new instance start and is also exaggerated when the start is from a cold container.