supabase / supavisor

A cloud-native, multi-tenant Postgres connection pooler.
https://supabase.github.io/supavisor/
Apache License 2.0
1.73k stars 55 forks source link

Unclear "Terminating with reason" errors #344

Closed haydn closed 3 months ago

haydn commented 4 months ago

Bug report

Describe the bug

In the "Pooler" logs in the Supabase dashboard I'm seeing errors like this:

DbHandler: Terminating with reason {{:timeout, {:partisan_gen_statem, :call, [#PID<0.3225.2235>, {:db_call, #PID<48802.1851.3665>, <<83, 0, 0, 0, 4>>}, 15000]}}, {:partisan_gen_statem, :call, [#PID<48802.1851.3665>, {:client_call, <<49, 0, 0, 0, 4, 50, 0, 0, 0, 4, 84, 0, 0, 1, 98, 0, 11, 105, 100, 0, 0, 0, 115, 155, 0, 1, 0, 0, 11, 134, 0, 16, 255, 255, 255, 255, 0, 0, 95, 100, 101, ...>>, :continue}, 30000]}} when state was :busy

I don't understand what this trying to tell me. Something has timed-out?

In our app we're seeing this as "Connection terminated unexpectedly" errors being emitted by node-postgres.

To Reproduce

I have not idea how to reproduce this reliably because I don't understand what the error is. I would love to reproduce it.

Expected behavior

No error.

Also one/both of these:

Screenshots

https://github.com/supabase/supavisor/issues/344#issuecomment-2157338747

System information

N/A

Additional context

I opened a support ticket about this 2 days ago, but I'm still waiting for a response from the "pooler team".

Response from the pooler team:

This log means that the client has initiated a disconnect, but the linked direct connection was not in an idle state.

Related discussion:

https://github.com/orgs/supabase/discussions/26879#discussion-6750486

haydn commented 4 months ago

I still don't know what the actual problem is, but I've been forced to bypass the pooler altogether and connect directly to the database. That has completely solved the error we were seeing in our app which makes me think there's an underlying issue in Supavisor causing this.

dedlockdave commented 4 months ago

Yes I am having the exact same issue that started popping up two days ago with my python app. I am using sqlalchemy = "2.0.30" asyncpg = "0.29.0"

The connection is established and then dies with asyncpg.exceptions.ConnectionDoesNotExistError: connection was closed in the middle of operation.

the connection is successfully established infrequently which has made the problem quite difficult to track down

haydn commented 4 months ago

@abc3 Have you had a chance to investigate this at all? We're using still using a direct connection to postgres at the moment (because it's more reliable than Supavisor), but our app is barely scraping by with that set-up because of the connection limit.

SkycladObserver commented 3 months ago

We're also experiencing a similar issue:

"ClientHandler: socket closed with reason {:timeout, {:partisan_gen_statem, :call, [#PID<50041.23646.8099>, {:db_call, #PID<0.1508.1393>, <<80, 0, 0, 3, 116, 0, 105, 110, 115, 101, 114, 116, 32, 105, 110, 116, 111, 32, 34, 109, 101, 115, 115, 97, 103, 101, 95, 103, 101, 110, 101, 114, 97, 116, 105, 111, 110, 34, 32, 40, ...>>}, 15000]}}, DbHandler {#PID<50041.23646.8099>, {:error, {:exit, {:timeout, {:partisan_gen_statem, :call, [#PID<50041.23646.8099>, :get_state_and_mode, 5000]}}}}}"

When clients try to connect, they get this error:

write CONNECTION_CLOSED aws-0-us-west-1.pooler.supabase.com:6543

We upgraded to the most expensive tier already, optimized our db, and added a caching layer, and we still get these errors. We also deployed Supabase Grafana and looked at our stats, and much of the activity had leeway.

abc3 commented 3 months ago

@haydn @SkycladObserver I'm looking for a fix, sorry about that

sheannethedev commented 3 months ago

Hello @abc3, also getting this write CONNECTION_CLOSED issue very frequently on my python backend. My query is very optimized (uses index scan) already. I've tried using the python supabase client and my own psycopg connection pool already.

This is disrupting our day to day operations. We're hoping this gets fixed ASAP. Thank you

abc3 commented 3 months ago

By the way, could you provide a small repo with steps on how to reproduce the issue? Did you use session or transaction mode? Are there queries with prepared statements? That would be very helpful to me

sheannethedev commented 3 months ago

Sure thing. I can provide a code snippet.

from supabase import create_client

@application.route("/myendpoint", methods=["POST"])
def myendpoint():
    try:
        data = request.get_json()
        db_client = supabase_client()
    except Exception as e:
        return jsonify({"status": "Failed", "error": repr(e)})

    try:
        output = do_something(data)
    finally:
        update_db(data, db_client)

    return jsonify(output)

def supabase_client(url=None, key=None):
    if url is None:
        url = os.getenv("SUPABASE_URL")
    if key is None:
        key = os.getenv("SUPABASE_SERVICE_ROLE_KEY")

    return create_client(url, key)

The update_db function is a simple update set query that uses an index scan.

The function do_something can take from 5 to 30s to finish. The behavior is unpredictable. The connection closed can happen for short or long request durations.

Im also using transaction mode. Our plans allow 500 direct connections and 120_000 supavisor connections. On grafana we only see under 70 connections both direct and supavisor. Our cpu usage is also very low.

SkycladObserver commented 3 months ago

I can give code snippet later, but it happens fairly often in transaction mode where inserts are quite frequent. For my case I'm using vercel serverless with Drizzle ORM that uses the node postgres driver: https://github.com/porsager/postgres. I also turned off prepared statements

SkycladObserver commented 3 months ago

@abc3 here's some sample code:

const queryClient = postgres(env.DATABASE_URL, {
  prepare: false,
});

const db = drizzle(queryClient, {
  schema,
  logger: process.env.NODE_ENV === "development",
});

async function sample() {
  const record = await insertInitialRecord();

  try {
    const externalData = await getDataFromExternalServer();

    await updateRecordWithExternalData(externalData);
  } catch(e) {
    await updateRecordStatusToFailed(record.id);
    throw e;
  }
}

Basically I have a function that inserts an initial record into the db. It gets data from an external server (can take from 3s - 10s) then updates the db record with the data. If the external server returns an error, update the record's status to failed. Note that these inserts/updates aren't in the same transaction because I need to insert the initial record first.

Note: idk if this affects anything, but we're also using supabase realtime.

abc3 commented 3 months ago

thanks guys ❤️

abc3 commented 3 months ago

Can you please check your connections again?

SkycladObserver commented 3 months ago

@abc3 It's good now!!! Thank you, our savior 🙏 Very very fast response <3

SkycladObserver commented 3 months ago

@abc3 I'm curious, what was the problem? Was it code or configuration issue?

sheannethedev commented 3 months ago

works on my end as well. thank you @abc3! I'd like to know the cause of the issue too if possible?

haydn commented 3 months ago

@abc3 Unfortuantely, we're still seeing this issue intermittently (4 out of the 13 requests hit this issue):

image

The errors in the Vercel logs look like this (it's a Next.js app, this is a node-postgres error):

Error: Client has encountered a connection error and is not queryable
at /var/task/node_modules/.pnpm/pg@8.7.1/node_modules/pg/lib/client.js:563:27
at process.processTicksAndRejections (node:internal/process/task_queues:77:11)

These all line-up with "Terminating with reason" errors in the Pooler logs:

image

This issue only appears to be occurring on pages where there's a significant amount of data getting returned by the queries. For example, here's the same page for a different user where there's very little data being returned from the database:

image

Again, bypassing the Supavisor and connecting directly to Postgres eliminates this issue completely (but introduces the problem of running out of connections 😟).

By the way, could you provide a small repo with steps on how to reproduce the issue? Did you use session or transaction mode? Are there queries with prepared statements?

chasers commented 3 months ago

@haydn think we may have handled this one. Let us know please!

Currently deployed only to Supavisor in ap-southeast-2

haydn commented 3 months ago

@chasers @abc3 You're my heroes! I'm not seeing any errors anymore:

image

I'll monitor it for the rest of today and post back here this afternoon if it's still looking okay.

haydn commented 3 months ago

It's worked flawlessly all day. I've reconnected our production app to the pooler as well — it also hasn't had any issues. Calling this fixed. Thanks for you're help!

blechatellier commented 3 months ago

@chasers we started to have issues about 24 hours ago in ap-southeast-2. Lots of connections terminated and dbhandler exited errors. This seems related to that release.

chasers commented 3 months ago

@blechatellier can you submit a support ticket so we can get a project id please?

chasers commented 3 months ago

See the issue here and it is stemming from expected SASL response, got message type 83

chasers commented 3 months ago

@blechatellier fix should be rolling out shortly

abc3 commented 3 months ago

@blechatellier can you please check again?

blechatellier commented 3 months ago

@chasers @abc3 thanks, I will monitor for the next few hours but so far no errors since 06:22:38.82 AEST. What was causing the issue?

daniel-xyz commented 3 months ago

@chasers I'm seeing similar errors on aws-0-eu-central-1.pooler.supabase.com, this one is from half an hour ago:

Screenshot 2024-06-14 at 10 38 53

Not sure how serious it is but I also see a lot of my queries failing without any specific error and I'm still trying to figure out the root cause. If I use the direct connection all my queries are successful. Could this be related? I just sent a support ticket as well.

Cali93 commented 3 months ago

I'm seeing the same issue as daniel-xyz. I get 500s without any specific error response.

abc3 commented 3 months ago

What was causing the issue?

@blechatellier the incoming data were not handled properly when spinning up a new pool

abc3 commented 3 months ago

@daniel-xyz @Cali93 hey, did you submit a support ticket?

abc3 commented 3 months ago

@SkycladObserver @sheannethedev there was a misconfiguration on one of the cluster nodes in your region

daniel-xyz commented 3 months ago

@daniel-xyz @Cali93 hey, did you submit a support ticket?

Yes, just saw your mail. :) Will test if the errors occur in session mode as soon as I'm back to work.