Is there any way to run a reliable Osmosis full node without constant manual intervention?

08d2 commented 1 year ago

I'm responsible for the availability of an infrastructure which includes, among many other things, multiple Osmosis full nodes. I've written automation tooling to keep these nodes up-to-date, which assumes (1) the releases provided by this GitHub repository are authoritative; and (2) those releases abide the semver semantics that they express.

Today, about an hour ago, my Osmosis mainnet full nodes got wedged and effectively died, reporting an app hash error. Those nodes were running v12.1.0 via Cosmovisor. It appears (at the moment, confirmation forthcoming) that the v12.3.0 upgrade released a few days ago was, contrary to the asserted semver version semantics, not actually backwards-compatible, and was in fact a necessary upgrade, beyond a certain height. (Tangentially, my Osmosis testnet nodes also died a couple days ago, claiming they needed a v13 upgrade, which is AFAICT not available.)

How could I have prevented this from happening? What should my automation tooling have done? More broadly, how can someone deploy an Osmosis full node that doesn't require constant manual intervention in order to be reliable?

ValarDragon commented 1 year ago

The mainnet state divergence yesterday was introduced in v12.2.0 (around a month ago), we tried our best to communicate that this was necessary as an emergency security upgrade on all available channels. It only caused a state divergence in mainnet yesterday. Folks should have been upgrading their full nodes across all of cosmos at the time.

It just so happens that no tx that was altered by the state machine difference from the security patches got executed prior to yesterday.

faddat commented 1 year ago

Hi @08d2 -- I think that this incident illustrates the importance of perpetually patching to the latest release -- possibly using automated tooling.

Our team was one of the ones that went down, meaning that we didn't help to secure the network in this case, a serious failing on our part.

What we're doing to address this, is improving cosmosia (https://github.com/notional-labs/cosmosia) to let us know about upgrades in a more timely fashion.

Concerning both validation and infrastructure provision, it's long been our practice to customize nodes like this:

git clone https://github.com/osmosis-labs/osmosis
cd osmosis
go mod edit -replace github.com/tendermint/tm-db=github.com/baabeetaa/tm-db@pebble
go mod tidy
go install -ldflags '-w -s -X github.com/cosmos/cosmos-sdk/types.DBBackend=pebbledb -X github.com/tendermint/tm-db.ForceSync=1' -tags pebbledb ./...

Please feel free to evaluate cosmosia for your use, it is fully open source and serves 10's of billions of rpc queries monthly.

There's a secondary, non-osmosis issue, too:

chain teams need to bump versions on the cosmos-sdk more frequently (but this isn't an osmo issue)

Ah and another thing -- pretty sure that our automated system for catching these bumps failed in this case. Don't rightly know why.

08d2 commented 1 year ago

@ValarDragon

The mainnet state divergence yesterday was introduced in v12.2.0 (around a month ago), we tried our best to communicate that this was necessary as an emergency security upgrade on all available channels. It only caused a state divergence in mainnet yesterday. Folks should have been upgrading their full nodes across all of cosmos at the time.

So semver says v12.2.0 expresses MAJOR version 12, MINOR version 2, and PATCH version 0. It defines an increment of the MINOR version to signal adding functionality in a backwards compatible manner. Which I guess means that v12.2.0 is supposed to be backwards-compatible with v12.1.0 — and with v12.0.0, and v12.5.13, or v12.anything, really. You can absolutely break compatibility whenever you want, but if you're using semver then when you do that you're supposed to increment MAJOR.

Is Osmosis not actually doing semver? That's no problem if so! I'm happy to change my mental model. Maybe add a disclaimer to the README, though? As I guess vX.Y.Z is a pretty unambiguous expression of semver to most people.

Assuming that's true, though, my original question remains. How can I write a program that will reliably keep an Osmosis full node up over time? Right now I'm basically monitoring the GitHub releases of this repo, filtering out -rcs, and then I guess now taking every new release regardless of MAJOR or MINOR or PATCH and shoving it into Cosmovisor under... what, I guess upgrades/vX.Y.Z/bin/osmosisd directly? Will this work, even with e.g. v12.1.0 followed by v12.2.0 and then v12.3.0 and etc.? Is there any chance that this automation tooling would stick a binary in Cosmovisor too soon and cause the node to crash somehow? Are there any other risks I might not be aware of?

08d2 commented 1 year ago

Friendly ping re: above questions :)

08d2 commented 1 year ago

Another friendly bump of this issue :)

osmosis-labs / osmosis

Is there any way to run a reliable Osmosis full node without constant manual intervention? #3465