hasura / graphql-engine

Blazing fast, instant realtime GraphQL APIs on your DB with fine grained access control, also trigger webhooks on database events.
https://hasura.io
Apache License 2.0
31.17k stars 2.77k forks source link

[Microsoft][ODBC Driver 17 for SQL Server]The connection is broken and recovery is not possible. The connection is marked by the client driver as unrecoverable. No attempt was made to restore the connection. #9227

Open julianomcl opened 1 year ago

julianomcl commented 1 year ago

Version Information

Server Version: v2.10.0 CLI Version (for CLI related issue): v2.10.0

What is the current behaviour?

We have Hasura hosted on a Cloud Run on Google Cloud Platform connected with a Postgres and a SQL Server databases. From time to time, Hasura is losing connection with the SQL Server database and unable to reconnect again.

We are receiving this exception message:

[Microsoft][ODBC Driver 17 for SQL Server]The connection is broken and recovery is not possible. The connection is marked by the client driver as unrecoverable. No attempt was made to restore the connection.

What is the expected behaviour?

We would like to Hasura reconnect with SQL Server automatically, so we don't have to release a new revision on Cloud SQL to recover the connection.

Please provide any traces or logs that could help here.

This is the log we have from Cloud Run:

{
  "insertId": "xxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "jsonPayload": {
    "level": "error",
    "detail": {
      "error": "database query error",
      "internal": {
        "exception": {
          "type": "unsuccessful_return_code",
          "message": "[Microsoft][ODBC Driver 17 for SQL Server]The connection is broken and recovery is not possible. The connection is marked by the client driver as unrecoverable. No attempt was made to restore the connection."
        },
        "query": "BEGIN TRANSACTION"
      },
      "path": "$",
      "code": "unexpected"
    },
    "timestamp": "2022-11-18T11:49:34.097+0000",
    "type": "event-trigger"
  },
  "resource": {
    "type": "cloud_run_revision",
    "labels": {
      "configuration_name": "hasura-prd",
      "service_name": "hasura-prd",
      "location": "southamerica-east1",
      "revision_name": "hasura-prd-xxxxxxxxxxxxxxxxxxxxxxxxxxxx",
      "project_id": "xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
    }
  },
  "timestamp": "2022-11-18T11:49:35.103839Z",
  "labels": {
    "instanceId": "xxxxxxxxxxxxxxxxxxxxxxxxxxxx"
  },
  "logName": "xxxxxxxxxxxxxxxxxxxxxxxxxxxx",
  "receiveTimestamp": "2022-11-18T11:49:35.195698667Z"
}

Keywords

sql server connection broken

ajohnson1200 commented 1 year ago

Thanks for bringing this up. My guess here is that we're not doing proper connection pool validation (examples: https://confluence.atlassian.com/conf76/surviving-database-connection-closures-1018769693.html) and that behind the scenes GCP is doing maintenance / moving around DB's (as is their operational model).

Could you share what you have set for idle timeout and max connections for the DB in question in the Hasura Console --> Data tab?

julianomcl commented 1 year ago

@ajohnson1200 Thanks for the response. We have these values:

SamirTalwar commented 1 year ago

Hi @julianomcl.

I've been digging into this issue and unfortunately can't reproduce it. The idle timeout of 5 seconds that you have set should mean that any connections to the database are either dropped within 5 seconds, or used once and then discarded, as we drop any connections where we receive an error (for exactly this reason). As far as I can tell this has been the case since versions of Hasura pre-v2.10.

In addition, I would have thought that Google Cloud Run would be periodically restarting your containers anyway (the documentation suggests that there's a maximum container lifetime of 1 hour, but I might be wrong here). Are you using a mechanism which makes the Hasura container persist longer than this?

Can you tell me if you're you still seeing these issues? And have you tried upgrading Hasura to see if it makes the problem go away?

tadeumaia commented 1 year ago

Hello, I work with @julianomcl

Thanks for all the work you already did debugging this issue, but we are still having this issue.

Hasura is connected to two DB's one MSSQL and one POSTGRES, these problems only happen on the MSSQL database and locks all queries, after a container restart at cloud run the error is resolved and everything goes back to normal until we get hit by the error again.

Could you please provide us with some guidance on how to further debug this?

Here is some more information that could help:

Connection String Driver={ODBC Driver 17 for SQL Server};Server=cloud-sql-internal-hostname:1433;Database=ourdb;Uid=user;Pwd=pwd;

Hasura Settings HASURA_GRAPHQL_ENABLED_LOGTYPES | startup,http-log,query-log,websocket-log,webhook-log HASURA_GRAPHQL_DEV_MODE | false HASURA_GRAPHQL_METADATA_DATABASE_URL | Our postgressdb HASURA_GRAPHQL_ENABLE_CONSOLE | true HASURA_GRAPHQL_ENABLED_APIS | graphql, metadata

Cloud Run settings resources: limits: cpu: 2000m memory: 1G startupProbe: timeoutSeconds: 240 periodSeconds: 240 failureThreshold: 1 tcpSocket: port: 8080

lucasnad27 commented 1 year ago

I'm connecting to a MS SQL Server database, running in Azure and encountering the same issue :/

I'm running v2.27.0-cloud.1 Happy to provide more details if needed. I can update an ENV var, which I believe triggers a restart in the hasura cloud environment and everything goes back to normal.

lucasnad27 commented 1 year ago

I'm trying to understand better how vital this bug is to address. I'm preparing to put my Hasura Cloud instance into its first production environment and expect these connection errors to attempt to self-heal. I have this happen at least once every other week, leading to my entire site being down.

My only way around this issue is to go into the Hasura cloud dashboard, update an env var, triggering a restart of the server. I've hooked Hasura up to numerous Postgres services without any issues. Is anyone running SQL Server instances on top of Hasura cloud in a production capacity? Surely there are others with this problem.

I know these issues are hard to reproduce. Fwiw -- it happens when I'm actively developing. e.g., adding new tables, changing permissions, etc.,

SamirTalwar commented 1 year ago

Hey folks, after a lot of prodding, I have come no closer to reproducing this.

Could you please give me as much information as you can about your database setup? For example, the last test I ran was on a Microsoft Azure SQL Database, with the pricing tier "General Purpose - Serverless: Gen5, 1 vCore".

It'd also be helpful to get exact version information:

> SELECT @@version AS version;
Microsoft SQL Azure (RTM) - 12.0.2000.8 Jun 1 2023 13:36:49 Copyright (C) 2022 Microsoft Corporation
julianomcl commented 1 year ago

Hi, @SamirTalwar, thank you for your assistance with this matter.

I have executed the query and obtained the following result:

Microsoft SQL Server 2019 (RTM-CU19) (KB5023049) - 15.0.4298.1 (X64) Jan 27 2023 16:44:09 Copyright (C) 2019 Microsoft Corporation Standard Edition (64-bit) on Linux (Ubuntu 18.04.6 LTS) <X64>
lucasnad27 commented 1 year ago

Version info as requested:

Microsoft SQL Azure (RTM) - 12.0.2000.8 Jun 1 2023 13:36:49 Copyright (C) 2022 Microsoft Corporation

I'm using Azure's fully-managed SQL Server:

Let me know if I can provide any additional information to help you debug. If I discover any sort of pattern to these outages, I'll be sure to post here as well. Given that it doesn't happen very often, it might be awhile 😮‍💨

SamirTalwar commented 1 year ago

Hi folks, we're having trouble getting to the bottom of this. To find out if it's our own connection pooling mechanism or not, we have added a new toggle in v2.30.1. You can set your pool settings to {"enable": false} (using the CLI or API; the Console does not support this).

When you have access to v2.30.1, please give this a shot and let us know if it resolves the issue or not. Either way, it will be valuable information for tracking down the issue.

Thanks.

lucasnad27 commented 1 year ago

Happy to do so Samir. I haven't used the CLI much, but should be able to get this done by EOW. I'll let you know when I'm able to flip the bit and hopefully I can re-produce the issue in the coming weeks 🤞🏻

tadeumaia commented 1 year ago

Some update from us to help you debug this.

We migrated our Hasura deployment from Cloud Run to GKE, both were using GCP VPN to connect to the database. All hasura configuration stayed the same.

We have been 22 days without a problem now.

lucasnad27 commented 1 year ago

@SamirTalwar, perhaps I'm being dense, but I don't see an obvious way to update my pool settings via hasura CLI. I've configured my CLI to point to the Hasura cloud instance via HASURA_GRAPHQL_ENDPOINT & HASURA_GRAPHQL_ADMIN_SECRET env vars. I've looked at the help documentation but I need help finding the right command.

Hasura version: v2.30.1-cloud.1 CLI version: v2.17.0

SamirTalwar commented 1 year ago

@lucasnad27: Using the CLI, you'll need to export your metadata with hasura metadata export, find your pool_settings (typically in databases.yaml), and update them to say something like:

pool_settings:
  enable: false

Then re-apply them with hasura metadata apply.

I highly recommend upgrading your CLI to match the server version first.

lucasnad27 commented 1 year ago

@SamirTalwar thanks for the guidance. Very helpful. I've updated the setting -- found in databases.yaml as mentioned -- to disable pool settings.

I'm hoping my staging hasura instance has enough traffic to replicate the issue I'm seeing in production (happened again a few days ago). If I don't see any issues on the staging database over the coming days, I'll make the same change to my production server. I'll keep this thread updated with my progress & findings.

SamirTalwar commented 1 year ago

Great, thanks!

Assuming it makes it to your production instance and you're monitoring machine statistics, it would be great to know if there's major changes to CPU load, memory usage, or network traffic anywhere.

If you still have issues, please let me know.

lucasnad27 commented 1 year ago

Promoted metadata update to production a few minutes ago. I don't have a ton of traffic on that instance yet, but it should be enough to give us some data points. I'll circle back early next week on this thread.