penumbra-zone / penumbra

Penumbra is a fully private proof-of-stake network and decentralized exchange for the Cosmos ecosystem.
https://penumbra.zone
Apache License 2.0
364 stars 289 forks source link

ci: automated testing of migration compatibility #4323

Open conorsch opened 2 months ago

conorsch commented 2 months ago

Is your feature request related to a problem? Please describe. When preparing a chain upgrade, manual testing of upgrades is an arduous process.

Describe the solution you'd like We should have an integration test that runs a devnet based on the currently-active testnet, via the most recently released tag, runs smoke tests against it to generate txs, then stops the network, runs the migration, restarts the network, and reruns the smoket ests.

Describe alternatives you've considered Alternatives are manual testing, which is both slow and error-prone. Longer-term we want a capable "sudo mode" for testing upgrades, which is tracked in #4265.

Additional context We'll need to be careful about endpoint compatibility: we cannot run an older version of pd and run the most recent smoke tests against it, because the view server implementations will not be compatible. We can sidestep this by running the smoke tests from the tagged release. Ideally, we'd be able to swap out the path to binaries within the tests via an env var to make it a bit more to test prior versions, without rebuilding from source every time.

conorsch commented 2 months ago

There's a spike on an overhaul of the smoke-test logic here https://github.com/penumbra-zone/penumbra/pull/4324 which provides a nice shape to extend into migration-testing.

hdevalence commented 2 months ago

We'll need to be careful about endpoint compatibility: we cannot run an older version of pd and run the most recent smoke tests against it, because the view server implementations will not be compatible.

Can you elaborate on this a little? Isn't our expectation that clients should work across upgrade boundaries?

conorsch commented 2 months ago

Can you elaborate on this a little? Isn't our expectation that clients should work across upgrade boundaries?

If you try to run a client from current main against a public testnet endpoint, you'll see in incompatibility message related to ongoing auction work:

❯ git rev-parse HEAD
7854a5fc561e2e3f514421c3ea97c80cea5a673e

❯ cargo run -q --release --bin pcli -- view sync
Error: proto response missing auction params

Those same dependencies carry over into the integration tests.

conorsch commented 2 months ago

Pushed a draft PR with a spike on local testing of migration logic, that can be promoted to a CI job once it's solid. Got surprised by a proto incompat error that may be spurious, so I'm going to run through the upgrade process manually to sanity-check that the scripting order is sound.

hdevalence commented 2 months ago

@conorsch https://github.com/penumbra-zone/penumbra/pull/4339 will simplify things a great deal

hdevalence commented 2 months ago

Can you elaborate on this a little? Isn't our expectation that clients should work across upgrade boundaries?

If you try to run a client from current main against a public testnet endpoint, you'll see in incompatibility message related to ongoing auction work [...]

Those same dependencies carry over into the integration tests.

Got it. I was assuming we would run the smoke test script from the original tag and then, post-upgrade, run the smoke test script from the new HEAD.

conorsch commented 2 months ago

Made substantial progress on this front. There are notably two types of testing going on:

  1. closed-world, intra-CI testing of a single-validator devnet;
  2. open-world, publicly-accessible, cluster-hosted migration testing

The former is potentially suitable for per-PR runs, although so far the runtime is quite long: ~20m or so. We recently shaved a ton of per-PR CI runtime off with #4324, so it'd be a shame to knock it back up again, but for assurance it'd be worth it. This type of testing is great for catching problems like #4430.

The latter case is more intensive, and isn't yet end-to-end automated yet. Given that its setup reuses the same architecture as the public testnet, it's able to catch more subtle bugs, like #4443. For now, I'll continue to use this setup as part of pre-release QA.