Hasura HealthCheck is throwing status 500 on AWS Aurora Serverless (Postgres)

sven-codeculture commented 2 years ago

Version Information

Server Version: hasura/graphql-engine:v2.2.1.cli-migrations-v3 Docker Image

Environment

OSS

What is the expected behaviour?

On automatic scaling of the DB in Aurora Serverless everything keeps working.

Keywords

"Aurora Serverless"

What is the current behaviour?

The health endpoints returns a status of 500. Users cannot fully use hasura.

How to reproduce the issue?

Connect Hasura to Serverless Aurora in AWS
Force a scale up/scale down in RDS Serverless to bigger/smaller instance

Please provide any traces or logs that could help here.

{"status":500,"http_version":"HTTP/1.1","url":"/healthz","ip":"10.0.154.222","method":"GET","content_encoding":null}}}
{"type":"http-log","timestamp":"2022-03-14T09:53:15.571+0000","level":"error","detail":{"operation":{"error":{"path":"$","error":"ERROR","code":"unexpected"},"request_id":"910e7dcd-f07a-4b10-9ac9-37288c39e1e5","response_size":48,"raw_query":""},"request_id":"910e7dcd-f07a-4b10-9ac9-37288c39e1e5","http_info":{"status":500,"http_version":"HTTP/1.1","url":"/healthz","ip":"10.0.154.222","method":"GET","content_encoding":null}}}
{"type":"scheduled-trigger","timestamp":"2022-03-14T09:53:24.668+0000","level":"error","detail":{"internal":{"statement":"BEGIN ISOLATION LEVEL REPEATABLE READ ","prepared":true,"error":{"exec_status":"FatalError","hint":null,"message":"current transaction is aborted, commands ignored until end of transaction block","status_code":"25P02","description":null},"arguments":[]},"path":"$","error":"postgres tx error","code":"postgres-error"}}

Any possible solutions?

Changing the HASURA_GRAPHQL_SCHEMA_SYNC_POLL_INTERVAL to 0 did make the issue a bit better, the issue is appearing less often after this was set however the issue is still appearing.

simonphughes commented 2 years ago

I have exactly the same issue with heathcheck throwing http 500 and the same messages in the logs. I've been having this for a number of months now and have no resolution so far. As our Aurora serverless scales up and down a lot during the day it certainly doesn't do this every time, but every time I see the issue there is a scale up/down of the database. This is causing us production issues and I am considering moving away from the serverless solution, but managing instances and sizes causes a whole world of other pain, so i would dearly love to find out a solution to this if there is one. For reference, I added Postgres transactions aborted #8330

tirumaraiselvan commented 2 years ago

Is it possible to provide the Postgres logs?

simonphughes commented 2 years ago

I'm happy to privately share any log files if it helps to resolve this issue. As I said before we don't always get the http 500 errors on every scale, but we are seeing lots of errors in the logs on every scale. Example:

sven-codeculture commented 2 years ago

We just migrated to Serverless v2 last week (Which just got GA now). With this everything is working fine. The error is not happening anymore.

hasura / graphql-engine