informalsystems / hermes

IBC Relayer in Rust
https://hermes.informal.systems
Apache License 2.0
451 stars 331 forks source link

Packets not clearing #1918

Closed RohitAudit closed 2 years ago

RohitAudit commented 2 years ago

Summary of Bug

I am using hermes relayer for gravity-bridge and persistence chain. I brought some tokens from ethereum to gravity and was trying to relay them to the persistence chain. The relayer was stopped and 6000 transaction was sent. When the relayer was started again, all packets got stuck and nothing is getting transmitted. I restarted the relayer after an hour but it is still the same. Link to the transaction

Debug Logs Printed:

2022-02-24T16:31:49.931978Z  INFO ThreadId(01) Hermes has started
2022-02-24T16:31:57.392575Z DEBUG ThreadId(1529) packet_cmd{src_chain=gravity-bridge-3 src_port=transfer src_channel=channel-24 dst_chain=core-1}:clear: packet seq. that still have commitments on gravity-bridge-3: 227, 228, 229, 230, 231, 232, 233, 234, 235, 236 (first 10 shown here; total=6029)
2022-02-24T16:31:57.392701Z DEBUG ThreadId(1529) packet_cmd{src_chain=gravity-bridge-3 src_port=transfer src_channel=channel-24 dst_chain=core-1}:clear: recv packets to send out to core-1 of the ones with commitments on gravity-bridge-3: 227, 228, 229, 230, 231, 232, 233, 234, 235, 236 (first 10 shown here; total=6029)

Version

v0.11.0 and v0.12.0

Steps to Reproduce

Start a relayer between gravity and persistence and pool more than 6000 transactions and then start the relayer to clear empty packets.


For Admin Use

romac commented 2 years ago

You might be able to clear all the packets on a given channel using the clear packets command added in v0.12.0: https://hermes.informal.systems/commands/relaying/clear.html

Nonetheless, it is indeed not normal that Hermes is not clearing any packets on start.

As a sanity check, do you have clear_on_start = true in the config?

RohitAudit commented 2 years ago

Yes, the config is correct. Also, I did use clear packets command added in v0.12.0. Got the same output and no extra logs for an hour. Should I just leave it for few hours to see if it clears or not?

Can hermes clear the packets in batch selected by us?

RohitAudit commented 2 years ago

Update: left the relayer for almost 4-5 hours. Some of the transaction went through but are really slow. It took relayer more than 2 hours to do the first transaction. Some error logs I got:

Feb 25 04:45:25 ip-172-117-101-44 hermes[177290]: 2022-02-25T04:45:25.682903Z DEBUG ThreadId(17) send_tx_check{id=L6pdf312c0}:send_tx{id=gravity-bridge-3}: send_tx: using 2162713 gas, fee Fee { amount: "0ugraviton", gas_limit: 4325426 } id=gravity-bridge-3
Feb 25 04:45:25 ip-172-117-101-44 hermes[177290]: 2022-02-25T04:45:25.698368Z DEBUG ThreadId(17) send_tx_check{id=L6pdf312c0}:send_tx{id=gravity-bridge-3}: broadcast_tx_sync: Response { code: Ok, data: Data([]), log: Log("[]"), hash: transaction::Hash(74DE7A2A6B5E948EF0C3A5CDBA639A7389542EAFA28C5348CACB4527F300EC0A) }
Feb 25 04:45:25 ip-172-117-101-44 hermes[177290]: 2022-02-25T04:45:25.698614Z DEBUG ThreadId(17) send_tx_check{id=L6pdf312c0}:send_tx{id=gravity-bridge-3}: sending 39 messages using account sequence 188
Feb 25 04:45:25 ip-172-117-101-44 hermes[177290]: 2022-02-25T04:45:25.698986Z DEBUG ThreadId(17) send_tx_check{id=L6pdf312c0}:send_tx{id=gravity-bridge-3}: max fee, for use in tx simulation: Fee { amount: "0ugraviton", gas_limit: 200000000 }
Feb 25 04:45:25 ip-172-117-101-44 hermes[177290]: 2022-02-25T04:45:25.706055Z ERROR ThreadId(17) send_tx_check{id=L6pdf312c0}:send_tx{id=gravity-bridge-3}:estimate_gas: failed to simulate tx. propagating error to caller: gRPC call failed with status: status: Unknown, message: "failed to execute message; message index: 0: timeout packet verification failed: clientID (07-tendermint-39), height (1-4963470): consensus state not found", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "x-cosmos-block-height": "993762"} }
Feb 25 04:45:25 ip-172-117-101-44 hermes[177290]: 2022-02-25T04:45:25.714566Z ERROR ThreadId(1546) packet_cmd{src_chain=gravity-bridge-3 src_port=transfer src_channel=channel-24 dst_chain=core-1}: will retry: schedule execution encountered error: failed with underlying error: gRPC call failed with status: status: Unknown, message: "failed to execute message; message index: 0: timeout packet verification failed: clientID (07-tendermint-39), height (1-4963470): consensus state not found", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "x-cosmos-block-height": "993762"} } retry_index=1
Feb 25 04:45:26 ip-172-117-101-44 hermes[177290]: 2022-02-25T04:45:26.935518Z DEBUG ThreadId(1546) packet_cmd{src_chain=gravity-bridge-3 src_port=transfer src_channel=channel-24 dst_chain=core-1}:clear: packet seq. that still have commitments on gravity-bridge-3: 261, 262, 263, 264, 265, 266, 267, 268, 269, 270 (first 10 shown here; total=5998)

Can you suggest some solution to this?

adizere commented 2 years ago

Wow, this is an interesting case, thanks for flagging @RohitAudit !

left the relayer for almost 4-5 hours. Some of the transaction went through but are really slow. It took relayer more than 2 hours to do the first transaction.

This is because of two reasons:

  1. Full node queries are slow, so assembling 6K txes (depending on the full node specs) can take a long time.
  2. Hermes has no pagination/batching support, so it tries to naively pull the packet data for all 6K transactions before relaying any single one of them.

One thing we can do is to add pagination support in Hermes (ref: #1816).

Even with pagination support, I'm not 100% about the error "consensus state not found" in your last comment. We're still looking into that. Can you tell us the SDK version that gravity-bridge-3 is running on?

RohitAudit commented 2 years ago

gravity-bridge-3 use cosmos-sdk v0.44.5 and ibc version v2.0.2. Also, if I connect the relayer with archival nodes, will it make the transaction faster??

adizere commented 2 years ago

gravity-bridge-3 use cosmos-sdk v0.44.5 and ibc version v2.0.2. Also, if I connect the relayer with archival nodes, will it make the transaction faster??

Thanks for the quick feedback!

WRT archival nodes: I think it may speed things up, yes (esp. if the nodes have also slightly higher-compute specs), but will ask some relayer operators if they have ideas of potential optimizations.

faddat commented 2 years ago

First of all, great work. Testing the network like this is the only way we improve much of anything.

Secondly, having an archive node can help. With that said, you will need very fast hard drives for that to work. Something that our team does when we encounter congestion is we run both the go relayer and Hermes.

They seem to compliment one another in practice in ways that we don’t fully understand yet but definitely have seen in action.

If you do that you will want to have two separate wallets. One for the go relayer and one for Hermes.

Concerning speed, my only consistent piece of advice is you want really fast drives.

clemensgg commented 2 years ago

in my experience querying archives is less effective because of the sheer db size involved; best would be to have a statesynced node that only retains a chainstate a bit longer than the packets are stuck.

@adizere in my opinion having pagination or batching support in such situations would greatly increase relayer production experience

faddat commented 2 years ago

You know you’re right and I probably should’ve mentioned that the archive node only really helps when you’re dealing with older packets

So one thing you probably do not want to do is state sync right now but instead, next time or you might not have the needed history

RohitAudit commented 2 years ago

I tried using two hermes relayer but both of them picks the same transaction and one fails. For now, I have started a go relayer as well. Lets see if it helps increase the speed. Thanks for the quick response.

clemensgg commented 2 years ago

the only thing that will significantly help this situation are dedicated endpoints for each redundant relayer (particularly on the src_chain)

faddat commented 2 years ago

yep, more endpoints will be more better

adizere commented 2 years ago

We'll be tracking support for query pagination in https://github.com/informalsystems/ibc-rs/issues/1816. Closing this as hopefully addressed.