v2.44.0.cli-migrations-v3: No retry for postgres database connection, container frozen

Version Information

Server Version: CLI Version (for CLI related issue): v2.44.0.cli-migrations-v3

Environment

self-hosted

What is the current behaviour?

We are running hasura in our kubernetes cluster. We have a postgres DB (it's ephemeral) in a container deployed next to it. On startup, hasura is faster than the postgres DB. So I naturally see the log entry

{"error":"connection error","path":"$","code":"postgres-error","internal":"connection to server at \"localhost\" (127.0.0.1), port 5433 failed: Connection refused\n\tIs the server running on that host and accepting TCP/IP connections?\nconnection to server at \"localhost\" (::1), port 5433 failed: Cannot assign requested address\n\tIs the server running on that host and accepting TCP/IP connections?\n"}

The problem that I have is, that hasura does not retry the postgres connection. There is no additional logging until the hasura gets killed by HASURA_GRAPHQL_MIGRATIONS_SERVER_TIMEOUT. Once a new container is spun up by kubernetes, the postgres db is ready and everything works fine.

I don't remember that we had a similar issue before. So it might be a regression issue, because we've been using the same setup for several months now.

The central question here is: Is there a retry mechanism for the database connection of the temporary server that's being created by the cli-migrations-v3 image?. From what I can see, there is not. Even when running the (tests)[https://github.com/hasura/graphql-engine/tree/master/packaging/cli-migrations/v3/test] in the hasura repo, I can see the same behavior if the postgres DB is not already available when hasura starts up.

I am willing to contribute to the project to fix this, if necessary!

What is the expected behaviour?

The container retries the DB connection in a (optional: configurable) interval.

How to reproduce the issue?

Use the docker-compose file from the (test folder)[https://github.com/hasura/graphql-engine/blob/master/packaging/cli-migrations/v3/test/docker-compose.yaml].
Run docker-compose up
Check the logs of the hasura container and see that there is no retry for the database connection
See that the container gets killed once HASURA_GRAPHQL_MIGRATIONS_SERVER_TIMEOUT has been reached.

Screenshots or Screencast

Please provide any traces or logs that could help here.

In the (test)[https://github.com/hasura/graphql-engine/blob/master/packaging/cli-migrations/v3/test/test.sh] section of the image I can see that the postgres DB is spun up before the hasura instance. Maybe it's a coincident, but this might approve my suspicion that there is no repeated check of the DB connection in the hasura instance.

Any possible solutions/workarounds you're aware of?

Implement polling/retrying for the database connection.

Keywords

auto-migrate, cli-migrations-v3, database, postgres, metadata

hasura / graphql-engine