informalsystems / hermes

IBC Relayer in Rust
https://hermes.informal.systems
Apache License 2.0
440 stars 326 forks source link

`clear_interval` parameter is not working as expected #2445

Closed achilleas-kal closed 2 years ago

achilleas-kal commented 2 years ago

Summary of Bug

clear_interval is set to 30 and packets are not being cleared automatically, I have to manually clear packets with the hermes clear packets command every once in a while.

Version

hermes 0.13.0+2375a069


For Admin Use

adizere commented 2 years ago

Can you share your [mode.packets] section of your config.toml?

romac commented 2 years ago

Assuming the config is fine, we will need to see the logs in order to diagnose the issue.

achilleas-kal commented 2 years ago

[mode.packets] enabled = true

clear_interval = 30

clear_on_start = true

tx_confirmation = false

Sure, here's the config @adizere

Maybe tx_confirmation should be set to true for this feature to work properly?

adizere commented 2 years ago

Maybe tx_confirmation should be set to true for this feature to work properly?

Hmm, no. Periodic packet clearing should trigger every clear_interval blocks independently of tx_confirmation, but this is a good lead to investigate whether we have a bug.

Some situations where periodic packet clearing does not work is if:

Do you think any of the two cases above might be the root cause? Are you seeing any activity for that channel in Hermes logs or metrics?

achilleas-kal commented 2 years ago

the packet workers for the channel you want to see cleared did not start altogether (Hermes did not get any WS events associated with that channel since it started)

Well I've been making transfers while that packet was stuck so I'd say that's not the case.

the channels are filtered out via policy based packet filtering

The channel is not filtered for sure, generally if it was I wouldn't be able to relay packets at all plus the explicit clear command wouldn't work right? I've manually checked as well and the channel is allowed.

adizere commented 2 years ago

Thanks for the quick feedback and checking on those things! This means we need to dig in the code + try to recreate this somehow. If you have any insights on what might help reproducing this behavior would be great.

the explicit clear command wouldn't work right

The explicit clear CLI ignores the packet filtering specified in config.tom, it would work.

achilleas-kal commented 2 years ago

@adizere I can try to replicate a stuck packet on my end and dig into the logs, any insight on what I'm looking for would help.

adizere commented 2 years ago

thanks!

any insight on what I'm looking for would help.

Here are some things I would do if I had access to your setup:

  1. start Hermes

  2. do $ kill -USR1 $HERMES_PID

    • that should give the full details of which are the running workers in Hermes, example:
    • 2022-07-27T14:46:32.679956Z INFO ThreadId(38) dumping state (triggered by SIGUSR1) 2022-07-27T14:46:32.693740Z INFO ThreadId(1114) 2022-07-27T14:46:32.693880Z INFO ThreadId(1114) Chains: ibc-0, ibc-1 2022-07-27T14:46:32.693896Z INFO ThreadId(1114) Client workers: 2022-07-27T14:46:32.693908Z INFO ThreadId(1114) - client::ibc-0->ibc-1:07-tendermint-1 (id: 3) 2022-07-27T14:46:32.693920Z INFO ThreadId(1114) | misbehaviour: true, refresh: true 2022-07-27T14:46:32.693933Z INFO ThreadId(1114) - client::ibc-1->ibc-0:07-tendermint-1 (id: 1) 2022-07-27T14:46:32.693992Z INFO ThreadId(1114) | misbehaviour: true, refresh: true 2022-07-27T14:46:32.694011Z INFO ThreadId(1114) Packet workers: 2022-07-27T14:46:32.694025Z INFO ThreadId(1114) - packet::channel-3/transfer:ibc-0->ibc-1 (id: 5) 2022-07-27T14:46:32.694036Z INFO ThreadId(1114) - packet::channel-4/transfer:ibc-1->ibc-0 (id: 6) 2022-07-27T14:46:32.694048Z INFO ThreadId(1114) Wallet workers: 2022-07-27T14:46:32.694059Z INFO ThreadId(1114) - wallet::ibc-0 (id: 2) 2022-07-27T14:46:32.694070Z INFO ThreadId(1114) - wallet::ibc-1 (id: 4) 2022-07-27T14:46:32.694083Z INFO ThreadId(1114)

      • this allows to understand if Hermes is actually activating on the problematic channel
  3. Would be curious to know on what channel are packets stuck and which are the two networks at the ends of that channel

    • any logs showing activity on that channel would be of help
  4. The full log of clear packets might be useful

Generally, if you open up all logs from hermes start and point out the channel or packet sequence numbers to us we can likely find some interesting things.

achilleas-kal commented 2 years ago

Hey @adizere, apologies for the delay just managed to look into this.

  1. Stuck packet: https://gist.github.com/achilleas-kal/3255482d5dd4effad48714ba88d3b85d (between Persistence and Injective, channel-41 and channel-82 respectively).
  2. Full dump logs of the last few minutes: https://gist.github.com/achilleas-kal/fda02c9b36b8b9c6ec88da2bf666a6f5
  3. Interestingly the packet was cleared after killing the pod, I assume through clear_on_start which relates to your issue here: https://github.com/informalsystems/ibc-rs/issues/2462 so didn't manage to get the clear packets log
  4. I found something interesting in the full logs though:

2022-08-01T07:17:30.564645Z WARN ThreadId(41) refresh{client=07-tendermint-57 src_chain=injective-1 dst_chain=core-1}: task encountered ignorable error: error raised while trying to refresh client 07-tendermint-57: failed querying client state on dst chain: RPC error to endpoint https://tm.persistence.injective.network/: serde parse error: expected value at line 1 column 1

I got the serde parse error again but this time there is no reference to a data field as we show in: https://github.com/informalsystems/ibc-rs/issues/2444 which also makes sense because the RPC in question (from Persistence) has both version and data fields exposed on the ABCI query. Something that might be worth looking at here.

  1. Some logs from the workers(filtered upon hermes start) https://gist.github.com/achilleas-kal/c5646bde8c140b2509e83b989030c37c

Hope this helps, btw many of our relayers on Injective have mentioned issues with stuck packets whereby they had to take manual action with hermes clear packets across a number of chains, it's not an isolated issue.

romac commented 2 years ago

4. I found something interesting in the full logs though:

2022-08-01T07:17:30.564645Z WARN ThreadId(41) refresh{client=07-tendermint-57 src_chain=injective-1 dst_chain=core-1}: task encountered ignorable error: error raised while trying to refresh client 07-tendermint-57: failed querying client state on dst chain: RPC error to endpoint https://tm.persistence.injective.network/: serde parse error: expected value at line 1 column 1

I got the serde parse error again but this time there is no reference to a data field as we show in: #2444 which also makes sense because the RPC in question (from Persistence) has both version and data fields exposed on the ABCI query. Something that might be worth looking at here.

This is a different query than in #2444, but rather an ABCI query for the path clients/07-tendermint-57/clientState, issued in the client refresh worker. If you can figure out from the logs at which height the query was issued, then you can perhaps try to query for this manually?

romac commented 2 years ago

This seems to work as expected:

❯ curl https://tm.persistence.injective.network -d '{
  "jsonrpc": "2.0",
  "id": "d18e80f5-e2fc-4381-aa2c-996b97116722",
  "method": "abci_query",
  "params": {
    "path": "store/ibc/key",
    "data": "636C69656E74732F30372D74656E6465726D696E742D35372F636C69656E745374617465",
    "height": "0",
    "prove": false
  }
}'

Note 1: the value of params.data above is the hex-encoding of clients/07-tendermint-57/clientState

{
  "jsonrpc": "2.0",
  "id": "d18e80f5-e2fc-4381-aa2c-996b97116722",
  "result": {
    "response": {
      "code": 0,
      "log": "",
      "info": "",
      "index": "0",
      "key": "Y2xpZW50cy8wNy10ZW5kZXJtaW50LTU3L2NsaWVudFN0YXRl",
      "value": "CisvaWJjLmxpZ2h0Y2xpZW50cy50ZW5kZXJtaW50LnYxLkNsaWVudFN0YXRlEoMBCgtpbmplY3RpdmUtMRIECAEQAxoECIDqSSIECIDfbioCCCgyADoHCAEQjLG6BkIZCgkIARgBIAEqAQASDAoCAAEQIRgEIAwwAUIZCgkIARgBIAEqAQASDAoCAAEQIBgBIAEwAUoHdXBncmFkZUoQdXBncmFkZWRJQkNTdGF0ZVABWAE=",
      "proofOps": null,
      "height": "7284159",
      "codespace": ""
    }
  }
}

Same if we specify the height explicitly:

❯ curl https://tm.persistence.injective.network -d '{
  "jsonrpc": "2.0",
  "id": "d18e80f5-e2fc-4381-aa2c-996b97116722",
  "method": "abci_query",
  "params": {
    "path": "store/ibc/key",
    "data": "636C69656E74732F30372D74656E6465726D696E742D35372F636C69656E745374617465",
    "height": "7250081",
    "prove": false
  }
}'
{
  "jsonrpc": "2.0",
  "id": "d18e80f5-e2fc-4381-aa2c-996b97116722",
  "result": {
    "response": {
      "code": 0,
      "log": "",
      "info": "",
      "index": "0",
      "key": "Y2xpZW50cy8wNy10ZW5kZXJtaW50LTU3L2NsaWVudFN0YXRl",
      "value": "CisvaWJjLmxpZ2h0Y2xpZW50cy50ZW5kZXJtaW50LnYxLkNsaWVudFN0YXRlEoMBCgtpbmplY3RpdmUtMRIECAEQAxoECIDqSSIECIDfbioCCCgyADoHCAEQjLG6BkIZCgkIARgBIAEqAQASDAoCAAEQIRgEIAwwAUIZCgkIARgBIAEqAQASDAoCAAEQIBgBIAEwAUoHdXBncmFkZUoQdXBncmFkZWRJQkNTdGF0ZVABWAE=",
      "proofOps": null,
      "height": "7250081",
      "codespace": ""
    }
  }
}

Note 2: here I am using the height before the first NewBlock events found in the logs above: 7250082 - 1 = 7250081

romac commented 2 years ago

I also tried setting up the injective-1 chain (I just specified the RPC address as `), andhermes query client state` gives me the expected result:

❯ RUST_LOG=tendermint_rpc=debug hermes query client state --chain injective-1 --client 07-tendermint-57
2022-08-03T15:40:06.639469Z DEBUG ThreadId(01) Incoming response: {
  "jsonrpc": "2.0",
  "id": "7c28b0ef-f6ef-49f2-935e-a323ade953af",
  "result": {
    "node_info": {
      "protocol_version": {
        "p2p": "8",
        "block": "11",
        "app": "0"
      },
      "id": "e0e1dd9d329d60c43b642f77dbce7199068fb2fa",
      "listen_addr": "tcp://65.0.93.49:26656",
      "network": "core-1",
      "version": "v0.34.14",
      "channels": "40202122233038606100",
      "moniker": "node",
      "other": {
        "tx_index": "on",
        "rpc_address": "tcp://0.0.0.0:26657"
      }
    },
    "sync_info": {
      "latest_block_hash": "E056FE4E05BC7D1CD5A9FB4F186ABFEA64A326E7B9CC106FDD285474431CF3FD",
      "latest_app_hash": "636C3884307740CF094C3042C0F157B82A5F8B9BDDFB3A004BF709F308A6F103",
      "latest_block_height": "7284306",
      "latest_block_time": "2022-08-03T15:39:58.404070601Z",
      "earliest_block_hash": "D9673C1D0DF6ACADCA29501A7EBC0D3FE6FBC7116FF2774D583C6625F07371F5",
      "earliest_app_hash": "E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855",
      "earliest_block_height": "1",
      "earliest_block_time": "2021-03-30T12:08:46.515027422Z",
      "catching_up": false
    },
    "validator_info": {
      "address": "059BD993FB36CC110108520614C3569AB913F2B2",
      "pub_key": {
        "type": "tendermint/PubKeyEd25519",
        "value": "lcjeCEqOE4ysLdoT+BWxLLZAOCtwgW10LMaVqgzQCA0="
      },
      "voting_power": "0"
    }
  }
}
2022-08-03T15:40:07.259676Z DEBUG ThreadId(15) Incoming response: {
  "jsonrpc": "2.0",
  "id": "e546034f-d93b-4e14-a818-936405e51f18",
  "result": {
    "response": {
      "code": 0,
      "log": "",
      "info": "",
      "index": "0",
      "key": "Y2xpZW50cy8wNy10ZW5kZXJtaW50LTU3L2NsaWVudFN0YXRl",
      "value": "CisvaWJjLmxpZ2h0Y2xpZW50cy50ZW5kZXJtaW50LnYxLkNsaWVudFN0YXRlEoMBCgtpbmplY3RpdmUtMRIECAEQAxoECIDqSSIECIDfbioCCCgyADoHCAEQjLG6BkIZCgkIARgBIAEqAQASDAoCAAEQIRgEIAwwAUIZCgkIARgBIAEqAQASDAoCAAEQIBgBIAEwAUoHdXBncmFkZUoQdXBncmFkZWRJQkNTdGF0ZVABWAE=",
      "proofOps": null,
      "height": "7284306",
      "codespace": ""
    }
  }
}
SUCCESS Tendermint(
    ClientState {
        chain_id: ChainId {
            id: "injective-1",
            version: 1,
        },
        trust_level: TrustThreshold {
            numerator: 1,
            denominator: 3,
        },
        trusting_period: 1209600s,
        unbonding_period: 1814400s,
        max_clock_drift: 40s,
        latest_height: Height {
            revision: 1,
            height: 13539468,
        },
        proof_specs: ProofSpecs(
            [
                ProofSpec(
                    ProofSpec {
                        leaf_spec: Some(
                            LeafOp {
                                hash: Sha256,
                                prehash_key: NoHash,
                                prehash_value: Sha256,
                                length: VarProto,
                                prefix: [
                                    0,
                                ],
                            },
                        ),
                        inner_spec: Some(
                            InnerSpec {
                                child_order: [
                                    0,
                                    1,
                                ],
                                child_size: 33,
                                min_prefix_length: 4,
                                max_prefix_length: 12,
                                empty_child: [],
                                hash: Sha256,
                            },
                        ),
                        max_depth: 0,
                        min_depth: 0,
                    },
                ),
                ProofSpec(
                    ProofSpec {
                        leaf_spec: Some(
                            LeafOp {
                                hash: Sha256,
                                prehash_key: NoHash,
                                prehash_value: Sha256,
                                length: VarProto,
                                prefix: [
                                    0,
                                ],
                            },
                        ),
                        inner_spec: Some(
                            InnerSpec {
                                child_order: [
                                    0,
                                    1,
                                ],
                                child_size: 32,
                                min_prefix_length: 1,
                                max_prefix_length: 1,
                                empty_child: [],
                                hash: Sha256,
                            },
                        ),
                        max_depth: 0,
                        min_depth: 0,
                    },
                ),
            ],
        ),
        upgrade_path: [
            "upgrade",
            "upgradedIBCState",
        ],
        allow_update: AllowUpdate {
            after_expiry: true,
            after_misbehaviour: true,
        },
        frozen_height: None,
    },
)

Note: I set RUST_LOG=tendermint_rpc=debug in the command above to inspect the raw JSON response of the command.

adizere commented 2 years ago

Closing. Confirmed with Achilleas that after upgrading to v1 the problem is no longer observed.