Open conorsch opened 1 month ago
Based on a research spike by @avahowell in collaboration with Astria, we tried setting compat_mode = '0.37'
in the hermes config for the noble testnet. With that setting, Hermes was able to create new channels, and can read chain state while starting up, but quickly lapses back into failing to parse rpc messages from the noble testnet node. Debug logs:
Oct 23 17:52:05 hermes hermes[2249147]: 2024-10-23T17:52:05.570757Z DEBUG ThreadId(27) event_source.rpc{chain.id=grand-1}: incoming response status=200 OK body={"jsonrpc":"2.0","id":"62c2c7de-e025-4fdb-b615-bb7946bc25d8","result":{"height":"15720881","txs_results":null,"finalize_block_events":null,"validator_updates":null,"consensus_param_updates":{"block":{"max_bytes":"5242880","max_gas":"-1"},"evidence":{"max_age_num_blocks":"100000","max_age_duration":"172800000000000","max_bytes":"1048576"},"validator":{"pub_key_types":["ed25519"]}},"app_hash":"nKlinSRSovQLIAX/VprNAPdNEVmw+ePctUKF0nS4o4s="}}
Oct 23 17:52:05 hermes hermes[2249147]: 2024-10-23T17:52:05.570822Z ERROR ThreadId(27) event_source.rpc{chain.id=grand-1}: failed to collect events: RPC error: serde parse error: subtle encoding error: bad encoding at line 1 column 441, retrying in 1.5s... height=15720881
We're surprised because this setting did resolve testnet relaying for Astria, but it hasn't for us. Another possible resolution is bumping the version of tendermint-rs
that we rely on, to include bug fixes like in v0.38.10:
This release fixes a bug in v0.38.x that prevented ABCI responses from being correctly read when upgrading from v0.37.x or below. It also includes a few other bug fixes and performance improvements.
Unclear whether upgrading the tendermint-rs version would constitute a consensus-breaking change. At the very least, we should understand whether bumping the dep resolves the issue we're seeing.
Paired with @avahowell to investigate the hermes setup. Turns out that despite the logged error messages, hermes does still properly relay packets. The current penumbra testnet has a short unbonding period, which results in short-lived ibc clients (on the order of 20m or so currently). We confirmed that:
penumbra-testnet-phobos-2
-> grand-1
are relayed successfully by hermesThe error messages are unfortunate, but also present on the penumbra/osmosis testnet service, which also uses cometbft v0.38.x on the counterparty side. We should rebase Hermes on latest upstream main, but that work should be tracked separately. We're also investigating a plan to publish the Penumbra workspace crates to crate.io, to support upstreaming the Penumbra config into hermes.
Still unresolved is the grpc problem that originally motivated this ticket. But as for the potential of breakage when Noble v8 is released, it appears that hermes operators should at compat_mode = '0.37'
to relevant chain configs—i.e., for any chain that's using cometbft v0.38.x—and then relaying will continue to work.
Describe the bug
An upcoming Noble chain upgrade to v8 is being prepared on the Noble testnet. For the Penumbra Labs testnet (https://testnet.plinfra.net), we've been running a version of Hermes that relays between
penumbra-testnet-phobos-2
andgrand-1
. On or around 2024-10-17, we started observing breakage when communicating with the Noble testnet node endpoint run by Polkachu:noble-testnet-grpc.polkachu.com:21590
I confirmed out of band with Polkachu that this breakage corresponded to deployment of the https://github.com/noble-assets/noble/releases/tag/v8.0.0-rc.2 tag to the testnet endpoint.
We first discovered this breakage when testing the behavior of the diff in https://github.com/penumbra-zone/penumbra/pull/4878. Similar breakage is also evident in the hermes relayer that PL is running.
Example error messages
When running on the feature branch for #4878:
When viewing the logs for the hermes relayer instance between testnets:
Additional context
The gRPC endpoint is at least functional enough to return service descriptors:
We also know that the cometbft rpc is returning structured data:
Although we should be careful to determine how the structure violates assumptions in the code, given the parse error.