Automatically retry remote schema introspection on failure

BenoitRanque commented 2 years ago

Is your proposal related to a problem?

On startup, and when metadata is reloaded, Hasura performs introspection requests against remote schemas. This may fail for multiple reasons. A common failure mode is a CI/CD setup that updates hasura metadata at the same time as it updates the third party server that houses the remote schema. If that server is offline during that process, and takes longer than hasura to come online, hasura won't be able to introspect the schema and will start with a metadata inconsistency.

This issue can then be resolved by manually reloading the remote schema from the console. This is of course not an ideal experience from a UX point of view.

Describe the solution you'd like

When remote schema introspection fails, hasura should periodically retry. Ideally the retry interval should be configurable.

Describe alternatives you've considered

Users may instead perform their own healthchecks against their remote schemas, and trigger a hasura metadata reload when the schema comes back online. This however involves a lot of coding and is generally inconvenient.

BenoitRanque commented 2 years ago

CC @miguelff FYI

carlosbaraza commented 2 years ago

This is quite a pain point in our deployment CI/CD too. In many occasions metadata becomes inconsistent because the container is periodically recycled and unfortunately sometimes the network might fail. If it does fail, it will break the schema, throwing errors in production and causing downtime.

Our current workaround is to trigger a metadata reload whenever the remote schema starts up again, but it is still kind of fragile.

We would like to have a mechanism to tell Hasura: trust me, the remote schema is what we tell you it is, without introspections. If the contract is broken, Hasura could throw errors instead. But breaking the schema at bootup with a simple introspection is quite fragile.

AThilenius commented 1 year ago

Hasura Team, could we get a bit of ❤️ here? This is one of 4 issues on this topic (#8396, #7064, #5126, #5117) without resolution, and it causes catastrophic failure in production so it seems like a pretty high priority. We've been bit by this in production a half dozen times now, with user downtime from each. Hasura as an API gateway HAS to be fault tolerant.

TL;DR version: Hasura boots before a remote schema host(s), resulting in 'inconsistent metata' which isn't retried and takes all of Hausra (our API gateway) offline.

ajohnson1200 commented 1 year ago

Noted, looking to tackle this in this quarter (Nov --> Jan)

Varun-Choudhary commented 1 year ago

On a slightly related note, we have improved the console UX when a remote schema is inconsistent. We are showing a badge at top of each tabs of RS if there is any inconsistency in RS.

zachequi commented 1 year ago

Noted, looking to tackle this in this quarter (Nov --> Jan)

Happy new year! Just wanted to bump this thread as this actively affects us in production as well, is this still being worked on and slated for release sometime this quarter?

WonderPanda commented 1 year ago

This one would impact our team in a huge way too. I've been eagerly watching this issue and we are primed to switch over to remote schema as soon as the fault tolerance here is addressed

tirumaraiselvan commented 1 year ago

Retrying remote schemas till they return a response will block startup and hence cause downtime anyways (if not in a rolling deploy environment). Secondly, if we retry it few times with a timeout and it still fails then we are back to the original problem. A better solution is to cache the earlier (or some) remote schema introspection in a persistent storage and use that (this is a significant development effort but something that we are actively considering).

A workaround for this is to consistently call the reload_remote_schema (docs) metadata API after any remote schema deployment either from your CI or somewhere else.

zachequi commented 1 year ago

Retrying remote schemas till they return a response will block startup and hence cause downtime anyways (if not in a rolling deploy environment). Secondly, if we retry it few times with a timeout and it still fails then we are back to the original problem. A better solution is to cache the earlier (or some) remote schema introspection in a persistent storage and use that (this is a significant development effort but something that we are actively considering).

A workaround for this is to consistently call the reload_remote_schema (docs) metadata API after any remote schema deployment either from your CI or somewhere else.

In a kubernetes environment, which is likely common (and how we use it), this blocking behavior is actually not a problem, it's desirable. I'd be thrilled for Hasura to do that and it would mean we could remove a bunch of hacks/horrible workarounds to ensure reliable bootup.

tirumaraiselvan commented 1 year ago

@zachequi Just to understand your expectations better, if you have N remote schemas with r retries and t timeout, then the time it could take to start you Hasura container is potentially N r t. So, say for 5 remote schemas, with 3 retries, with 60s intervals, this is essentially 15 mins. Are these kind of numbers acceptable in general (note that once this is configured you cannot revert them without Hasura being up again)?

Also, the second issue still persists: A Remote Schema still may not be up after the retries are over. What are your thoughts on tackling this scenario?

We are very keen to fix this issue but a proper solution seem to be to persist the earlier remote schema introspection result and some way to refresh it with the new introspection (after the deployment is complete). We are working on such a solution but it is a considerable effort.

WonderPanda commented 1 year ago

In our case we would definitely prefer to have the remote schema metadata stay stale (with the possibility that it could cause runtime errors) if it can't be iniitally refreshed. Ideally Hasura would expose information through an API endpoint that there are stale and potentially out of date remote schema definitions so we could act on those accordingly.

As it stands right now we're concerned about adopting Remote Schemas because we don't have the capacity to build custom deployment scripts that will keep trying in case of failure and can't afford to risk Production Downtime if there's a timing issue between when the remote schema is updated and when Hasura boots up.

zachequi commented 1 year ago

@zachequi Just to understand your expectations better, if you have N remote schemas with r retries and t timeout, then the time it could take to start you Hasura container is potentially N r t. So, say for 5 remote schemas, with 3 retries, with 60s intervals, this is essentially 15 mins. Are these kind of numbers acceptable in general (note that once this is configured you cannot revert them without Hasura being up again)?

Yes, this is desirable. In our use case we're not waiting for a timeout, the request either succeeds or fails almost immediately (waiting for another docker container to finish booting). A more realistic scenario for us is N schemas, 5-10s interval, and 0.1s timeout. So it would add 5-20 seconds at bootup, which is actually exactly the behavior we're looking for.

Also, the second issue still persists: A Remote Schema still may not be up after the retries are over. What are your thoughts on tackling this scenario?

In the absence of the ability to use a persisted/cached remote schema, I would want Hasura to retry indefinitely. I can imagine this isn't desirable for all users though and might make a good configuration option.

We are very keen to fix this issue but a proper solution seem to be to persist the earlier remote schema introspection result and some way to refresh it with the new introspection (after the deployment is complete). We are working on such a solution but it is a considerable effort.

For our usecase hosting a stale metadata is better than what it does now (booting with invalid metadata and thus missing APIs -> effectively guaranteeing a production outage) but not ideal. An old schema will still serve APIs but will be missing any updates and thus may serve some broken APIs.

So there are three options of what to do when a remote schema isn't available:

Boot without remote schemas. This guarantees a broken production experience as ALL remote schema calls fail
Boot with a stale schema. This MAY break production, depending on the difference between stale and fresh schemas
Don't boot / delay boot until we get a fresh schema -> guarantees correctness in production

A retry at bootup is the only solution though that guarantees us correct behavior so that a freshly booted hasura instance has up to date metadata and serves the correct API to clients. I'd argue option #1, the current behavior, is unacceptable in all scenarios and should never happen.

AThilenius commented 1 year ago

@tirumaraiselvan

Retrying remote schemas till they return a response will block startup and hence cause downtime anyways

That's the very nature of retry logic, yes. That is not the core issue though.

Whether or not the retry logic blocks Hasura calls (that don't require the remote schema) I'm personally unopinionated on. It's not on the Hasura team to make sure my custom backend successfully and timely boots / recovers from failure.

Hasura permanently entering a bad state until a human or monitoring system realizes our entire customer-facing website is down, is the problem. I would be happy with something as simple as an infinite retry every 5 seconds when a remote schema is down.

zachequi commented 1 year ago

Thank you @tirumaraiselvan very much for the time on the phone yesterday. We discussed this issue in detail and came up with a different solution that, at least for our usecase and likely others using Kubernetes, solves the problem. The discussed solution is to update the logic for the hasura healthcheck url (or provide an ability to configure the behavior of the healthcheck url) to incorporate metadata status and return 500 if initial metadata loads are unhealthy.

For us, using Kubernetes, if Hasura fails to fetch metadata on bootup it is likely that hasura won the bootup race before the backend was available. A failing healthcheck would trigger a K8S probe (likely a startupProbe) to restart the pod after a few seconds and continue in that loop until our backend is available, all the while not receiving any production traffic. In essence, the responsibility for actually doing retries is shifted from Hasura to Kubernetes.

I want to say again thank you to the Hasura team for being so patient/responsive to my bothering them about this issue!

vincentjames501 commented 1 year ago

This has been a pain point for us as well. I'm personally fine with Hasura starting up and serving requests for all non-broken remote schemas paths and periodically retrying the refresh of remote schema. It's the internet after all and requests are bound to periodically fail or a service may be experiencing issues transiently. Having to to do manual intervention is not ideal.

JefferyHus commented 1 year ago

Same here in a GCP environment, the workaround is to just kill the pod if the introspection fails, this way it keeps retrying till the remote schema is ready. I wish we could have a better and easier way to do so. I would prefer a solution provided in a previous comment, where you cash the latest succesful introspection and keep using it untill the new one is up and running.

tirumaraiselvan commented 1 year ago

Hey folks, as I previously mentioned, using a previously cached/stored copy of introspection is being tracked here: https://github.com/hasura/graphql-engine/issues/9561

hasura / graphql-engine