breakage against Noble testnet v8.0.0-rc.2

conorsch commented 1 month ago

Describe the bug

An upcoming Noble chain upgrade to v8 is being prepared on the Noble testnet. For the Penumbra Labs testnet (https://testnet.plinfra.net), we've been running a version of Hermes that relays between penumbra-testnet-phobos-2 and grand-1. On or around 2024-10-17, we started observing breakage when communicating with the Noble testnet node endpoint run by Polkachu:

grpc: noble-testnet-grpc.polkachu.com:21590
rpc: https://noble-testnet-rpc.polkachu.com

I confirmed out of band with Polkachu that this breakage corresponded to deployment of the https://github.com/noble-assets/noble/releases/tag/v8.0.0-rc.2 tag to the testnet endpoint.

We first discovered this breakage when testing the behavior of the diff in https://github.com/penumbra-zone/penumbra/pull/4878. Similar breakage is also evident in the hermes relayer that PL is running.

Example error messages

When running on the feature branch for #4878:

❯ cargo run -q --release --bin pcli --  --home ~/.local/share/pcli view noble-address --channel channel-221 --noble-node http://noble-testnet-grpc.polkachu.com:21590/
Error: status: Internal, message: "failed to decode Protobuf message: TxResponse.raw_log: BroadcastTxResponse.tx_response: invalid string value: data is not UTF-8 encoded", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "x-cosmos-block-height": "15378766"} }

When viewing the logs for the hermes relayer instance between testnets:

Oct 17 23:17:22 hermes hermes[1971991]: 2024-10-17T23:17:22.462927Z ERROR ThreadId(27) event_source.rpc{chain.id=grand-1}: failed to collect events: RPC error: serde parse error: invalid utf-8 sequence of 1 bytes from index 0 at line 1 column 312, retrying in 1.5s... height=15322789
Oct 17 23:17:24 hermes hermes[1971991]: 2024-10-17T23:17:24.305214Z ERROR ThreadId(27) event_source.rpc{chain.id=grand-1}: failed to collect events: RPC error: serde parse error: invalid utf-8 sequence of 1 bytes from index 0 at line 1 column 312, retrying in 2s... height=15322789
Oct 17 23:17:26 hermes hermes[1971991]: 2024-10-17T23:17:26.646205Z ERROR ThreadId(27) event_source.rpc{chain.id=grand-1}: failed to collect events after 4 attempts: RPC error: serde parse error: invalid utf-8 sequence of 1 bytes from index 0 at line 1 column 312 height=15322789
Oct 17 23:37:58 hermes hermes[1971991]: 2024-10-17T23:37:58.192927Z ERROR ThreadId(23) spawn:chain{chain=grand-1}:client{client=07-tendermint-317}:connection{connection=connection-267}:channel{channel=channel-221}:worker.client.refresh{client=07-tendermint-317 src_chain=penumbra-testnet-phobos-2 dst_chain=grand-1}:foreign_client.refresh{client=penumbra-testnet-phobos-2->grand-1:07-tendermint-317}:foreign_client.validated_client_state{client=penumbra-testnet-phobos-2->grand-1:07-tendermint-317}: client state is not valid: latest height is outside of trusting period! latest_height=2-465929 network_timestamp=2024-10-17T23:37:51.810809503Z consensus_state_timestamp=2024-10-17T15:24:04.628116411Z elapsed=29627.182693092s

Additional context

The gRPC endpoint is at least functional enough to return service descriptors:

❯ grpcurl -plaintext noble-testnet-grpc.polkachu.com:21590 list | grep noble.forwarding
noble.forwarding.v1.Query

We also know that the cometbft rpc is returning structured data:

❯ curl -s https://noble-testnet-rpc.polkachu.com/status | jq -r .result.node_info.version
0.38.12

Although we should be careful to determine how the structure violates assumptions in the code, given the parse error.

conorsch commented 1 month ago

Based on a research spike by @avahowell in collaboration with Astria, we tried setting compat_mode = '0.37' in the hermes config for the noble testnet. With that setting, Hermes was able to create new channels, and can read chain state while starting up, but quickly lapses back into failing to parse rpc messages from the noble testnet node. Debug logs:

Oct 23 17:52:05 hermes hermes[2249147]: 2024-10-23T17:52:05.570757Z DEBUG ThreadId(27) event_source.rpc{chain.id=grand-1}: incoming response status=200 OK body={"jsonrpc":"2.0","id":"62c2c7de-e025-4fdb-b615-bb7946bc25d8","result":{"height":"15720881","txs_results":null,"finalize_block_events":null,"validator_updates":null,"consensus_param_updates":{"block":{"max_bytes":"5242880","max_gas":"-1"},"evidence":{"max_age_num_blocks":"100000","max_age_duration":"172800000000000","max_bytes":"1048576"},"validator":{"pub_key_types":["ed25519"]}},"app_hash":"nKlinSRSovQLIAX/VprNAPdNEVmw+ePctUKF0nS4o4s="}}
Oct 23 17:52:05 hermes hermes[2249147]: 2024-10-23T17:52:05.570822Z ERROR ThreadId(27) event_source.rpc{chain.id=grand-1}: failed to collect events: RPC error: serde parse error: subtle encoding error: bad encoding at line 1 column 441, retrying in 1.5s... height=15720881

We're surprised because this setting did resolve testnet relaying for Astria, but it hasn't for us. Another possible resolution is bumping the version of tendermint-rs that we rely on, to include bug fixes like in v0.38.10:

This release fixes a bug in v0.38.x that prevented ABCI responses from being correctly read when upgrading from v0.37.x or below. It also includes a few other bug fixes and performance improvements.

Unclear whether upgrading the tendermint-rs version would constitute a consensus-breaking change. At the very least, we should understand whether bumping the dep resolves the issue we're seeing.

conorsch commented 1 month ago

Paired with @avahowell to investigate the hermes setup. Turns out that despite the logged error messages, hermes does still properly relay packets. The current penumbra testnet has a short unbonding period, which results in short-lived ibc clients (on the order of 20m or so currently). We confirmed that:

Hermes can create channels fine, and reports no errors
Testnet withdrawals from penumbra-testnet-phobos-2 -> grand-1 are relayed successfully by hermes
Hermes properly posts client update msgs to keep the channel open

The error messages are unfortunate, but also present on the penumbra/osmosis testnet service, which also uses cometbft v0.38.x on the counterparty side. We should rebase Hermes on latest upstream main, but that work should be tracked separately. We're also investigating a plan to publish the Penumbra workspace crates to crate.io, to support upstreaming the Penumbra config into hermes.

Still unresolved is the grpc problem that originally motivated this ticket. But as for the potential of breakage when Noble v8 is released, it appears that hermes operators should at compat_mode = '0.37' to relevant chain configs—i.e., for any chain that's using cometbft v0.38.x—and then relaying will continue to work.

penumbra-zone / penumbra