payloadcms / payload

Payload is the open-source, fullstack Next.js framework, giving you instant backend superpowers. Get a full TypeScript backend and admin panel instantly. Use Payload as a headless CMS or for building powerful applications.
https://payloadcms.com
MIT License
25.69k stars 1.63k forks source link

Payload with getPayloadHMR don't close or spams db connections during Static Site Generation #7312

Open Dovchik opened 3 months ago

Dovchik commented 3 months ago

Link to reproduction

No response

Payload Version

3.0.0-beta.68

Node Version

v20.11.1

Next.js Version

next@15.0.0-rc.0

Describe the Bug

I'm using postgres 12, and static generation with getStatisProps combined with local api of getPayloadHMR Having 32 statically generated pages I encounter a problem that the last few pages are not generated right away, worker for generation terminates and restarts, finishing page generation. On second screenshot you can see a peak of postgres connections - max 100 for my system - in idle state, after peak drops, that's when workers are terminated, restarted and finished. This works after some optimizations I've made, because before that, SSG were failing with too many clients from postgres. Is this an intended behavior and I should scale connections as count of my pages growth or that's going something wrong with how connections are managed in local api? image image

Reproduction Steps

When I will have time maybe I will try to create an example repo, but rough reproduction:

  1. Get a postgress db - set max_connection to 100 or less.
  2. Create a bunch of pages
  3. Try generate with next build

Adapters and Plugins

db-postgres

jmikrut commented 3 months ago

Hey @Dovchik β€” let us check this out on our end and get back to you

Dovchik commented 3 months ago

Great, thanks @jmikrut This is seems to be related: https://github.com/payloadcms/payload/issues/7254#issue-2420561994 I'd as well encountered connection termination on Vercel, but seen this only once as I'm mostly rely on SSG which seems to start wrecking when you scaling the amount of pages.

jmikrut commented 3 months ago

OK so here's some more info.

I just did a test with over a hundred pages to statically generate:

import type { Page } from 'payload-types'
import { getPayloadHMR } from '@payloadcms/next/utilities'
import config from '@payload-config'

const Page: React.FC<{ params: { id: string } }> = async ({ params }) => {
  const payload = await getPayloadHMR({ config })

  const page = await payload.findByID({
    collection: 'pages',
    id: params.id,
  })

  console.log(`page ${page.title} rendering`)

  return page.title
}

export default Page

export const generateStaticParams = async () => {
  const payload = await getPayloadHMR({ config })

  const query = await payload.find({
    collection: 'pages',
    limit: 200,
  })

  return query.docs.map((doc) => ({
    id: doc.id,
  }))
}

I've also added a quick console.log('INITIALIZED') to my Payload config's onInit which will log each time that Payload is freshly initialized (thus creating a new database connection).

On build time, I see all of the pages generated (and successfully console logging the page title), but I only get 8 INITIALIZED logs.

Why is there more than 1 INITIALIZED message? Because Next.js tries to run different build steps in different Node.js threads, so that work can be done in parallel. That means that things like getStaticParams will be run in a separate thread, needing to initialize Payload again.

But the INITIALIZED messages don't scale linearly (or at all) related to how many pages I build. I could do more, and the INITIALIZED messages do not go up. That means to me that our internal Payload init memoization layer is working as intended, so something else must be happening with your config / setup.

Can you create a minimally reproducible example so that I can test further?

On a separate note, we're going to try and look into properly disconnecting the database once the process is terminated. It might just be that the connections are not being properly "cleaned up" and staying open, which might add up over time (especially in production). They should idle and timeout, but we can probably do something about that. Hang tight for that. In the meantime, I'd love to see if you are actually seeing a Payload initialization per static page you're generating. That would be helpful!

Dovchik commented 3 months ago

@jmikrut I think I had found a culprit I've created a repo where I got the same behavior I described before, here is the link to the repo: https://github.com/Dovchik/reproduce-connection-overflow/tree/main All was building smoothly, but then I added generateMetadata, which also fetches payloadHMR, and the error happened: https://github.com/Dovchik/reproduce-connection-overflow/blob/3c61431638384aa3dff5a75c2af243c2a607f0c7/src/app/%5Blocale%5D/pages/%5Bslug%5D/page.tsx#L11

jmikrut commented 3 months ago

@Dovchik nice, ok, we're on it

denolfe commented 3 months ago

Hey @Dovchik , thanks for the repo and helping get to the bottom of this.

I pulled it down and ran it. On build, it peaked at ~45 connections - still not great but not 100 like you are seeing. Started at 11, ended at 14.

What version of PG are you running?

We're looking into some connection optimization now.

Dovchik commented 3 months ago

Hey @denolfe , it's postgres v12.18 Now, the build is going good, peaking in 45 for a different project where I initially encountered the problem, and just 5 connections for repo I had posted πŸ˜•

For background, I'm running hyper-v VM linux with 30 virtual thread as my machine has 32 available. I just did an experiment where I reduced virtual threads count available for VM from 30 to 8 and build came through okay Now, setting it back to 30 and build is also okay ❓ πŸ˜•

Also, @jmikrut for repo I posted I get 27 INITIALIZED log in async onInit(payload) for 30 threads, and just 7 when thread count is 8 - which is obvious, meaning node utilized thread to generated in parallel - more thread more parallel processes - more connections.

30 threads: image Only 8 threads image

Anyway, my workaround for local build is reducing thread count, but that's far from ideal, as I need to restart the VM each time when I need multi-thread performance for my other stuff.

Also, If I encounter the problem again I will try to reproduce again, I'll let you know if I will got any success.

Dovchik commented 3 months ago

Anyway, the bug persist, monitoring my Vercel deployments, gives around 50 active connections, which are not dropped.

denolfe commented 3 months ago

Hey @Dovchik , how are you monitoring connections?

I went ahead and deployed the repo to Vercel and connected it up to Vercel Postgres. However, I was only able to see the connections top out at ~20. I was monitoring by running this query every second:

SELECT pid, usename, application_name, client_addr, backend_start
FROM pg_stat_activity;

Anything I might be missing that you're doing differently?

Dovchik commented 3 months ago

hey @denolfe I use Dashboard in DBeaver for that Sorry for the confusion, the vercel's connection count is for my other project, not the repo I posted. When I open page connection count start to rise, but after some time dropping again, depends on what I do, go to dashboard or other places it may raise to around 50. If I try to deploy in that moment, I will get failing build as too many connections are floating around. As well as site will start throwing 500

denolfe commented 3 months ago

Unfortunately, I'm still not able to recreate. I think a solid reproduction is needed at this point.

In the meantime, you could play with a few values that may help:

const nextConfig = {
  experimental: {
    workerThreads: false,
    cpus: 2,
  },
}
laurent-boundary commented 3 months ago

Hi, we are experiencing a similar issue with the DB connection increasing, I wanted to note that we don't use getPayloadHMR we also not using version 3; but 2 with db-postgres:

"payload": "2.14.1",
"@payloadcms/db-postgres": "0.8.5",

With the adapter:

db: postgresAdapter({
    // Ref: https://payloadcms.com/docs/database/postgres#options
    pool: {
      connectionString: env.DATABASE_URI,
      max: env.POOL_MAX_CONNECTIONS, // Default 100 (we tried 1, not working)
      idleTimeoutMillis: 10000,
    },
  }),
denolfe commented 3 months ago

@laurent-boundary Is this during development, when building, or deployed? If deployed, what platform?

laurent-boundary commented 3 months ago

We have containerized with docker Payload and Postgres for development, on local environment we can see 17 connections with one user. On prod we quickly see the number of connections ramp up to 100 (it seems 100 connections limitation works but not 1)

Payload is running on node:20.12-alpine, we build it like so:

"build:payload": "yarn payload build",
"build:server": "tsc",

On prod we deploy to AWS this image on a EC2 Fargate.

DB wise, for dev we spin up this docker image postgres:16-alpine, on AWS we have an RDS with postgres.

Once we reach the limit the server send a status 429, too many requests or just become unresponsive

All of that happen while running, not building, (we use version 2 of Payload not v3) that's why I'm not sure if it's fully related.

erdidoqan commented 3 months ago

We have a similar issue and it is a critical issue that prevents us from using payload. We are using Payload v3, PostgreSQL, and Vercel. The issue occurs when we send about 20 parallel requests to Payload. We see "too many clients already" errors in the logs. Yo can reproduce the issue by sending different number of requests. Even a media collection page that sends many requests for images can produce this issue.

jmikrut commented 3 months ago

For everyone in this thread - we are continuing to look into this. One thing to try in the meantime is to see if the excess connections are caused by transactions.

Can anyone here try to disable transactions by passing transactionOptions: false to your postgresAdapter? We will also try this on our end today.

abid-salahi commented 3 months ago

I will try the transactionOptions: false on my site to see if it resolves anything. I've also been running into this issue on a static blog page setup where it creates 400+ connections after navigating to a statically generated blog page. The build process on Vercel does not give errors.

Note I'm using the standard MongoDB connection setup and fetching blog metadata from payload during the generateMetadata step.

laurent-boundary commented 3 months ago

@jmikrut transactionOptions do not exist for postgres adapter it seems, at least typescript doesn't allow. image

jmikrut commented 3 months ago

@laurent-boundary you will likely need to update your @payloadcms/db-postgres version in order to get access to the transactionOptions property

laurent-boundary commented 3 months ago

We recently updated to the latest, that seems to still be the latest, that was our first idea to fix the problem : image image

I just updated Payload to the latest 2.25.0 version as well: image

While I still couldn't have typescript accepting, I tried to force it with ts-ignore image image

I then tried to refresh the page a few time and navigate through collections (not even adding anything to the DB) and still have a spike of connections (locally with one user, myself) image

jmikrut commented 3 months ago

Ah @laurent-boundary if TS says that property isn't there then it definitely isn't gonna work. I think I led you astray earlier - transactionOptions was released solely for the 3.x Postgres adapter.

All of that happen while running, not building, (we use version 2 of Payload not v3) that's why I'm not sure if it's fully related.

I think this is correct. It is likely not related to any v2 -> v3 changes and is likely more linked to transactions in general. Let us backport transactionOptions to 2.x so that you can test with that. Would be good to know if that fixes it for you!

laurent-boundary commented 3 months ago

Amazing, I'll try that when ready :) Thanks πŸ™

laurent-boundary commented 3 months ago

@jmikrut in the meantime I created a repro for payload v2, I used npx create-payload-app@latest with blank and postgres options. Multiple things to note:

denolfe commented 3 months ago

Hey @laurent-boundary , thank you for the reproduction repo.

Running your docker-compose file with payload removed, produces a baseline connection count of 6.

When adding payload back into the mix and clicking around, I was only able to bring the connection count up to 9. This makes me think this is a non-issue (for v2 at least) and just a misunderstanding of the baseline connection count.

That being said, the pool options of the db adapter can accept a max value if you also want to pass that. This is what it would look like. You'll need at least 2 from my testing.

  db: postgresAdapter({
    pool: {
      connectionString: process.env.DATABASE_URI,
      max: 2,
    },
  }),
laurent-boundary commented 3 months ago

On production in AWS we use postgres through RDS AWS managed DB (not docker), we spike at 100 connections, and I believe that's definitely linked to Payload, I'll check that and complete the repro until I see something similar. We tried the max with one unsuccessfully, but we will try 2 and see.

laurent-boundary commented 3 months ago

I tried the max: 2 it works for a bit and then when trying to access the admin page, we get an infinite loading.

dmytro-lishchenko commented 3 months ago

any update here?

Seems like it's related: I use s3 bucket. If disable images in chrome settings - level of connections is ~5 all the time. But once images are enabled, simple opening the website causes 300+ connections to mongoDB. On my page I fetch globals with few objects and ~4-5 images + 2 collections with 6+8 images.

here is my config:

export default buildConfig({
  admin: {
    user: Users.slug
  },
  collections: [
    Users,
    Media,
    Category,
    Products,
    Orders,
  ],
  globals: [Settings],
  editor: lexicalEditor(),
  localization: {
    locales: ['en', 'uk'],
    defaultLocale: 'uk',
    fallback: true
  },
  i18n: {
    supportedLanguages: { en, uk },
    defaultLanguage: 'uk'
  },
  secret: process.env.PAYLOAD_SECRET || '',
  typescript: {
    outputFile: path.resolve(dirname, 'payload-types.ts')
  },
  db: mongooseAdapter({
    url: process.env.DATABASE_URI || '',
    connectOptions: {},
    transactionOptions: false
  }),
  maxDepth: 2,
  sharp,
  plugins: [
    s3Storage({
      collections: {
        media: true
      },
      bucket: process.env.S3_BUCKET,
      config: {
        credentials: {
          accessKeyId: process.env.S3_ACCESS_KEY_ID,
          secretAccessKey: process.env.S3_SECRET_ACCESS_KEY
        },
        region: process.env.S3_REGION
      }
    })
  ]
})

Update: I modified connect options, it reduced connections count from ~400 to ~200 on first load. Still far from being good, but at least, I hope, it will prevent reaching the limit of 500 connections. Worth to mention, my website started to load much faster

connectOptions: {
      serverSelectionTimeoutMS: 5000,
      socketTimeoutMS: 15000,
      maxIdleTimeMS: 10000,
      connectTimeoutMS: 5000,
      minPoolSize: 1,
      maxPoolSize: 5,
      maxConnecting: 2,
      waitQueueTimeoutMS: 5000
    }
Screenshot 2024-08-09 at 12 59 39
jmikrut commented 3 months ago

Hey @dmytro-lishchenko β€” this is interesting. So you have transactions disabled, but you're still seeing connection spikes?

Couple questions:

  1. Where are you deploying Payload?
  2. How much traffic does your site get?
  3. Are you statically rendering anything? Or are you dynamically rendering everything?

If you're on Vercel or a similar serverless environment, you should be using maxIdleTimeMS as you found, which will terminate unused connections after the specified period of inactivity. This is probably what is helping your issue. You could probably also reduce the maxPoolSize further down from 5 as well.

jmikrut commented 3 months ago

OK so we've done a LOT of digging on this and have come up with the following learnings.

The number one most important thing to get to the bottom of is to find out if Payload is connecting to the database multiple times.

You should be able to determine this by adding a console.log to your config.onInit method, indicating "INITIALIZED" or similar. Everything that we've tried thus far shows that Payload is, as expected, only connecting to the database when it needs to and is already properly caching database connections. You might see that your process does indeed connect multiple times, but that's probably due to multiple CPU threads or Next.js inner-workings as I described here.

Outside of what Payload is already doing, I really don't think that there's much we can do further in order to cache the connection / ensure that it is re-used accordingly.

That being said, we still want to provide solutions to everyone here, and our solutions will depend on where and how you are hosting your project.

Vercel / Serverless hosts

On Vercel or any serverless host, with traditional database connections, it's relatively normal to see a bunch of connections that scale linearly based on how many serverless functions are started for your application. Each serverless function will spawn its own connection, and as your site has more and more lambda functions spun up to serve traffic, you'll see more database connections. This is what I think is the case with @dmytro-lishchenko and anyone else in this thread experiencing this issue while deploying to Vercel.

Some database vendors like Supabase and I believe Vercel Postgres will indeed create mechanisms to handle this such as pgBouncer and similar.

What we plan to do for serverless hosts is to create serverless-friendly HTTP adapters, which will mitigate this by not opening a traditional database connection and solely fetching data from HTTP rather than going through TCP / similar.

Then, if you deploy on Vercel, you should use a serverless-friendly DB, or use one of our HTTP adapters.

Long-running servers still experiencing this problem

Some of you in this thread appear to be experiencing these connection counts increasing while deploying on traditional services like AWS Fargate (@laurent-boundary) and honestly this completely baffles me. I would LOVE to get to the bottom of this. We can't recreate any of these scenarios though and will need a solid reproduction in order to help here, as all our tests show a stable connection count which does not rise. Note that it's normal to have more than 1 connection in a long-running server but the connection number should not rise exponentially.

@laurent-boundary do you have a way of knowing how many Fargate containers are spun up to handle your traffic? If you have lots of Fargate containers, your issue may be similar to the Vercel / serverless information I just described above. Connections will scale linearly with containers and if you have many, you may need to convert to an HTTP adapter like I described above as well.

Tips to mitigate this

As @dmytro-lishchenko pointed out, you can largely limit the amount of connections that are created by fine-tuning your connection options (like mentioned here). I would suggest that anyone experiencing this tries to dial in their connection options to see if that helps put a lid on the connection count until we can further diagnose.

Conclusion

What we are most interested in at this point is a consistent reproduction of connection counts spiking on long running servers - i.e. not Vercel or serverless hosts because that is relatively expected. We will keep this issue open until we can release HTTP database adapters for serverless environments.

More from us shortly!

bm-spencer commented 3 months ago

@jmikrut Hi, I am in the same team with Laurent and I'll follow this issue.

First of all, thank you for the reply.

We are only use a single container at the moment on AWS, our setup for this application is a single cluster with a single service running a single task, and we don't have much traffic in our web application as we only had a single client in PROD using it when we experienced the issue. The highest amount of concurrent users we had was less then 20 when everyone at the office was using it in STG.

We tweaked the connections configurations a few times but every time it still reached the max number of the pool size and connections we had set, it only took more time to get to that point but in the end it got to that point, and when that happens we get the behavior that Laurent said where the applications stops responding cause the DB connections reached the max.

jmikrut commented 3 months ago

Hey @bm-spencer that sounds bad. Absolutely no clue what would be causing that.

Can you try to set this property on your pool config options to see if that resolves the issue?

idleTimeoutMillis: 30000

Going to continue looking into this. Are you deploying only Payload 2.0, without Next.js running on the same server? Or are you deploying Next.js as well?

Do you have anything crashing your Node service, which might cause it to restart and then re-connect, but not release the existing connection?

bm-spencer commented 3 months ago

@jmikrut Hey, we've already set that to idleTimeoutMillis: 10000.

This is our config.

  db: postgresAdapter({
    // Ref: https://payloadcms.com/docs/database/postgres#options
    pool: {
      connectionString: env.DATABASE_URI,
      max: env.POOL_MAX_CONNECTIONS, // 500; 
      idleTimeoutMillis: 10000,
    },
  }),

We set POOL_MAX_CONNECTIONS to 500 as lower values make the application crash earlier when they reach the max pool size and the connections are not closed in time.

Our server.js looks something like this.

// server.js
const app = express();

app.use(/* middlewares */)

const start = async (): Promise<void> => {
  await payload.init({
    secret: env.PAYLOAD_SECRET,
    express: app,
    onInit: () => {
      payload.logger.info(`Payload Admin URL: ${payload.getAdminURL()}`);
    }
  });

  app.listen(env.PORT);
};

start();

We are using

  // package.json
    "@payloadcms/bundler-webpack": "1.0.7",
    "@payloadcms/db-postgres": "0.8.5",
    "@payloadcms/plugin-sentry": "0.0.6",
    "@payloadcms/richtext-slate": "^1.0.0",
    "drizzle-kit": "0.22.7",
    "drizzle-orm": "0.31.2",
    "express": "^4.17.1",
    "payload": "2.14.1",

and we run the server with this command yarn serve

  // package.json
  "serve": "cross-env PAYLOAD_CONFIG_PATH=dist/payload.config.js NODE_ENV=production node dist/server.js"

So I don't think it would restart the server if it crashes.

danielfrenda commented 3 months ago

Hey @jmikrut we're having similar issues in production to @bm-spencer on a very similar setup both infrastructure (ECS + AWS RDS) and version wise While trying to get to the bottom of our problems we added a log of the pool waiting size payload.db.pool.waitingCount to our healthcheck endpoint (which is performing a simple count of user records). This sits at 0, but once or twice a day inexplicably skyrockets to thousands, before the checks fails and a new instance is deployed

coagmano commented 3 months ago

Alongside seeing db connection spam (I work with Danielfrenda), we have also seen some issues with unclosed transactions / idle in transactions that could be contributing to the pool freezing up and our instance becoming unresponsive.

SCR-20240815-kznu

Could also be related to whatever is happening here?

danielfrenda commented 3 months ago

To add to the above, the default idle transaction timeout on RDS is 24hrs. When this is reached and the transaction is terminated the resulting error throw is not caught by payload and will cause the process to exit. This is also the case if those transactions are killed manually

greenlover1991 commented 2 months ago

(our situation is different from OP, apologies for hijacking)

We also encountered a memory leak issue when connecting to a MongoDB ReplicaSet but not when connected to a non-ReplicaSet. What fixed (a workaround) our issue was to downgrade the MongoDB server (self-hosted) from v7.0.0 to v6.0.9

Before we arrived to the fix:

  1. We tried disabling transactions by setting transactionOptions: false,, did not fix the memory leak but it did fix the "Too much open connections" issue.
  2. Our onInit logs were only printing once, so the DB adapter did not create multiple connections.
  3. We tried shortening the timeouts like above but this only increased the rate of memory churn, so we reverted back.
  4. Saw this compatibility table from Mongoose and noticed that @payloadcms/db-mongodb: 1.7.1 is using "mongoose": "6.12.3",, which is only compatible up to MongoDB server 6.x

Here's our current working setup:

DB Server: MongoDB v6.0.9

  // package.json
  "dependencies": {
    "@payloadcms/db-mongodb": "^1.7.1",
    "payload": "^2.25.0",
  }  
// payload.config.ts
export default buildConfig({
  // ...
  collections: [Users, Posts],
  db: mongooseAdapter({
    url: 'mongodb://host1.mydomain.com:60200,host2.mydomain.com:60200,host3.mydomain.com:60200/myDbName',
    connectOptions: {
      auth: {
        username: 'myUsername',
        password: 'myPassword',
      },
      authSource: 'myAuthSource',
      tls: true,
      tlsCertificateFile: process.env.TLS_CERTIFICATE_FILE,
      tlsCertificateKeyFile: process.env.TLS_CERTIFICATE_KEY_FILE,
      tlsCAFile: process.env.TLS_CA_FILE,
      replicaSet: process.env.REPLICA_SET_NAME,
    },
    transactionOptions: false,
  }),
});

Here's a graph of the memory usage (encircled is when the MongoDB version downgrade was applied) image001

andershermansen commented 2 months ago

Payload is using mongoose 6.12.3

Mongoose 6.12.4 has release notes that upgrade mongodb driver to 4.17.2 Release notes: https://github.com/Automattic/mongoose/blob/53d382b0f9df9590fcb1c1468de38dbeb56b7f03/CHANGELOG.md#6124--2023-12-27

Mongodb driver 4.17.2 has release notes about a connection leak issue Release notes: https://github.com/mongodb/node-mongodb-native/releases/tag/v4.17.2

Not sure if it's relevant, but wanted to mention it.

We also see high connection usages with our mongodb instances at mongodb.com, but have not narrowed it down.

WilsonLe commented 2 months ago

Same problem here, the number of connections keep increasing with every request and does not seem to go down. If I terminate them (manually or set idle timeout), payload exits with the following error.

error: terminating connection due to administrator command
    at u.parseErrorMessage (/.next/server/chunks/416.js:139:49786)
    at u.handlePacket /.next/server/chunks/416.js:139:46444)
    at u.parse /.next/server/chunks/416.js:139:45090)
    at Socket.<anonymous> /.next/server/chunks/416.js:139:41858)
    at Socket.emit (node:events:519:28)
    at addChunk (node:internal/streams/readable:559:12)
    at readableAddChunkPushByteMode (node:internal/streams/readable:510:3)
    at Readable.push (node:internal/streams/readable:390:5)
    at TCP.onStreamRead (node:internal/stream_base_commons:191:23)
    at TCP.callbackTrampoline (node:internal/async_hooks:130:17) {
  length: 116,
  severity: 'FATAL',
  code: '57P01',
  detail: undefined,
  hint: undefined,
  position: undefined,
  internalPosition: undefined,
  internalQuery: undefined,
  where: undefined,
  schema: undefined,
  table: undefined,
  column: undefined,
  dataType: undefined,
  constraint: undefined,
  file: 'postgres.c',
  line: '3304',
  routine: 'ProcessInterrupts'
}
Error: Connection terminated unexpectedly
    at h.<anonymous> (/.next/server/chunks/416.js:203:15615)
    at Object.onceWrapper (node:events:633:28)
    at h.emit (node:events:519:28)
    at Socket.<anonymous> (/.next/server/chunks/416.js:203:27108)
    at Socket.emit (node:events:519:28)
    at TCP.<anonymous> (node:net:338:12)
    at TCP.callbackTrampoline (node:internal/async_hooks:130:17)

I'm running the program with pnpm build && pnpm start on a bare metal server using pm2:

After a little digging I realized that I had a collection hook that creates it's own transaction (separate from the request's transaction) and I forget to commit it in all cases (there were premature returns in the hook and I did not commit those transaction if it early returns).

const separateTransaction= async ({
  doc, // full document data
  payload,
}: {
  doc: AppUserLog;
  payload: Payload;
}) => {
  const req = {
    payload,
    transactionID: await payload.db.beginTransaction(),
  } as PayloadRequest;
  try {
    if (true){
        // premature return without committing
        return;
    }
    if (
      req.transactionID &&
      typeof req.payload.db.commitTransaction === "function"
    ) {
      await req.payload.db.commitTransaction(req.transactionID);
    }
  } catch (e) {
    req.payload.logger.error(e);
    if (
      req.transactionID &&
      typeof req.payload.db.rollbackTransaction === "function"
    )
      await req.payload.db.rollbackTransaction(req.transactionID);
  }
};
tiago-meireles commented 2 months ago

@coagmano I'm dealing with the exact same issue you described above with the idle transactions and the unresponsive CMS. Did you ever found a way to prevent/mitigate this from happening?

coagmano commented 2 months ago

@tiago-meireles We worked around the issue by setting query timeouts in the config for the postgres adaptor and then handling the error when the connection is killed.

db: postgresAdapter({
    migrationDir: path.resolve(__dirname, 'migrations'),
    pool: {
      application_name: 'payload-cms',
      connectionString: process.env.PAYLOAD_POSTGRES_URI,
      idleTimeoutMillis: 20000,
      idle_in_transaction_session_timeout: 20000,
      query_timeout: 20000,
      // @ts-expect-error - lock is not defined in the types - however supported in pg library
      lock_timeout: 10000,
      max: 20,
    },
}),

The timeout that actually resolved this specific issue was idle_in_transaction_session_timeout.

But it was still throwing an unhandled error from within pg which would kill the server, so we also needed to add a global unhandled exception handler that didn't kill the process for these exceptions.

process.on('uncaughtException', (error) => {
  if (error.message && error.message.includes('Connection terminated unexpectedly')) {
    // Exception when Postgres terminates the connection due to idle_in_transaction_session_timeout
    console.error('PostgreSQL connection error:', error);
  } else if (error.message && error.message.includes('idle-in-transaction')) {
    // Exception when pg's idle_in_transaction_session_timeout expires
    console.error('PostgreSQL idle-in-transaction error:', error);
  } else {
    console.error('Uncaught Exception thrown', error);
    process.exit(1);
  }
});

We also did some optimization of db queries for RichText / LexicalHTML fields by skipping the HTML serialization hooks in GraphQL queries unless the HTML field was specifically requested. This allowed the sitemap to render in 280 - 360ms instead of 1.1 - 4.4+ seconds - requiring an order of magnitude fewer db connections. Here's a snippet from our util function for defining CollectionConfigs

const serializedHTMLField = lexicalHTML(field.name, { name: `${field.name}_html` });
// ...
        {
          ...serializedHTMLField,
          required: field.required,
          defaultValue: 'placeholder',
          validate: () => true,
        },
          hooks: {
            afterRead: [
              async (args) => {
                // Prevent serializing HTML field in GraphQL API unless explicitly requested
                if (
                  args.req.payloadAPI === 'GraphQL' &&
                  !args.req.body?.query.includes(serializedHTMLField.name)
                ) {
                  return null;
                }
                // Directly call the afterRead hook to serialize the HTML field
                return serializedHTMLField.hooks?.afterRead?.[0]?.(args);
              },
            ],
          },
        } as TextField,

And finally, we now run 2 instances of payload on ECS with the minimum allowed health check duration, so if it does die the failover happens faster.


After a little digging I realized that I had a collection hook that creates it's own transaction (separate from the request's transaction) and I forget to commit it in all cases (there were premature returns in the hook and I did not commit those transaction if it early returns).

I double checked after reading this and we don't access payload.db (except to log the size of the query queue) or do any separate transaction handling ourselves. So in our case it appears that the unclosed transactions are from payload or drizzle

arime-pareto commented 1 month ago

Hi, we have the same issue in production (deployed in Vercel, latest beta, using a free MongoDB Atlas cluster). The connections in the screenshot below happen when I just load our landing page and browse a couple of other content pages. This obviously impacts site responsiveness negatively, not to mention hitting the connection limits in MongoDB Atlas.

Screenshot 2024-09-12 at 9 45 09

Has any MonboDB user figured out a workaround for this issue? Or maybe any news on the HTTP adapters mentioned in @jmikrut's post?

If we deploy the software in Google Cloud Run, we have no issues, but we'd like to use Vercel instead.

dmytro-lishchenko commented 1 week ago

any update?

bm-spencer commented 1 week ago

We found a potential issue with Virtual field usage in our case. We have nested relationships that use virtual fields to look up the relations between the document. We implemented based on this document: https://payloadcms.com/blog/learn-how-virtual-fields-can-help-solve-common-cms-challenges

Every time an API request is made, the virtual fields look ups are triggered which initiates the recursive DB queries instantly and hits max connection count. Once we disable the virtual fields. The connection seems stable.