Stop all services if core settings upgrade fails

sisuresh commented 4 months ago

Resolves https://github.com/stellar/quickstart/issues/547

An example of the failure that this can result in can be seen here - https://github.com/sisuresh/docker-stellar-core-horizon/pull/1.

sisuresh commented 4 months ago

We have this issue on a larger stage that for this image when things go wrong, it rarely stops running. Sometimes this is okay, as it would be actually more disruptive if the failure of a single service that a dev might not even be using fails to start. But routinely these quiet failures cause us to miss a failure until some later point where we are debugging the problem indirectly.

I'm saying this because I spent yesterday debugging a failure in another part of the start script where it just kept on going, with a small error in the logs that was not easy to spot.

Instead of the change here, I'm wondering if we should set some different failure configs so that if anything goes wrong we hard exit the script, and so at the location of this check we'd do that instead of stopping services.

The change in this PR prevents the github actions from passing, so we'll never publish a broken image that does this type of error checking. Hard exiting would be ideal though, but I'm not sure how to do that. If that's possible, then we should go that route.

leighmcculloch commented 4 months ago

Looks like there are still situations where the upgrade can fail, such as if the file needs updating due to an xdr change, and the image can keep running.

sisuresh commented 4 months ago

Looks like there are still situations where the upgrade can fail, such as if the file needs updating due to an xdr change, and the image can keep running.

Argh yeah good point. This check requires the previous steps to pass. A "health" check at the end where we validate that the upgrade went through would be better. The fact that the errors are just swallowed by the service is annoying though.

sisuresh commented 4 months ago

I'm looking into doing something like https://serverfault.com/a/922943.

sisuresh commented 4 months ago

@leighmcculloch was the error you ran into in the start script or within a service? I made a change to just trap on ERR and kill supervisor, but it'll only trap in the upgrade_local function for now.

github-actions[bot] commented 3 months ago

This pull request is stale because it has been open for 30 days with no activity. It will be closed in 30 days unless the stale label is removed.

sisuresh commented 3 months ago

@leighmcculloch have you seen an action error like the one currently failing in this PR before? It looks like the build-stellar-core step under Testing just times out after 6 hours.

leighmcculloch commented 3 months ago

I've seen it a couple times, intermittent, not consistent. On other PRs rerunning the build resulted in a pass. I kicked it again.

stellar / quickstart

Stop all services if core settings upgrade fails #601