informalsystems / hermes

IBC Relayer in Rust
https://hermes.informal.systems
Apache License 2.0
436 stars 323 forks source link

Something blocking Hermes from properly starting up and spinning up workers #4101

Closed freak12techno closed 3 weeks ago

freak12techno commented 1 month ago

Summary of Bug

Somehow Hermes fails to spin up workers (or rather is stuck spinning up workers), and apparently it's doing it sequentially, therefore if it fails to spin up workers on the first chain, other won't spin up so Hermes won't properly do anything.

Example: Here's my test config: https://gist.github.com/freak12techno/1a995d3822d5fee50e4c569298c6b8d6, and running Hermes with trace logging produces these results: https://gist.github.com/freak12techno/74f02dd94df0fd627312f0f62a90a37f. Seems like it's done spinning workers for bitsong-2b, cosmoshub-4, jackal-1, then it somehow is stuck on spinning workers for osmosis-1 (not sure why, that's another thing, likely the node not working properly) and all of the chains going after osmosis-1 in config (so sentinelhub-2 and neutron-1) are not loaded properly.

I also faced a case a few times where it's bitsong-2b which is faulty, so none of the chains are having their workers running and therefore Hermes effectively does nothing at that point for all of the chains.

I have a feeling that the workers spinning process is sequential, and the fix would be making it asynchronous, so failing to load 1 chain won't fail loading others. @ljoss17 I remember you investigating the clearing packets routine blocking Hermes functioning, which should be somewhat similar, can you check this out?

Just to clarify: my main concern is not a failure of a single chain (like in my example, Hermes failing to spin up Osmosis worker), but rather a failure of a single chain blocking Hermes from doing anything else.

Version

1.10.1

Steps to Reproduce

  1. Use my Hermes config
  2. Start the relayer
  3. Expect something like this in logs if Hermes somehow is stuck spinning up a worker.

Acceptance Criteria

Failing to spin up a worker for one chain should not fail Hermes from working properly on others.


For Admin Use

ljoss17 commented 1 month ago

Hey @freak12techno, could you run the Hermes instance with the flag --debug=rpc and share the full logs?

romac commented 1 month ago

The scan is what can take a long time and/or fail, so we should indeed scan the chains in parallel, and gracefully handle any failures there. Once the scan is done, spawning the workers should be very fast and not a bottleneck at all.

freak12techno commented 1 month ago

@ljoss17 so I started Hermes with debug=rpc and ran it for like 10 minutes, here are the logs: https://gist.github.com/freak12techno/43d9b674388f35b000b7b22979424483. Apparently at least the wallet worker for cosmoshub-4 had never started, as I fail to see the metrics regarding cosmoshub-4 wallet balance (I see the metrics for balances of wallets on bitsong-2b, jackal-1, sentinel, same with osmosis-1 (these chains are the last in the config).

@romac agree. I also created another issue on scanning the chains in parallel as well, which should speed it up, but that goes out of the scope of this one.

freak12techno commented 5 days ago

@ljoss17 I seem to have the same issue again on 1.10.13 - after restarting, seems like all the chains are scanned, but somehow Hermes isn't doing anything at all. Pretty sure it's because of one of the nodes misbehaving (likely the Osmosis one), but I have paths that I can relay that are not involving Osmosis (for example DVPN <=> ATOM), and it seems like it's not behaving correctly here as well. Wonder if I should create another issue on that, or reopen this one.

Metrics (as you can see, after the restart it isn't submitting anything at all):

image

Using pretty much the same config as above, and here are my logs: https://gist.github.com/freak12techno/8ce404c507700d3ac73f483d5ca6d2db. Can you check this out?

romac commented 5 days ago

What happens if you comment out the osmosis chain and all channels tagged # Osmosis from your config?

romac commented 5 days ago

It's weird that Hermes is scanning all clients/connections/channels on all chains. Are you still using an allowlist for each chain?

freak12techno commented 5 days ago

@romac sorry, I forgot that I disabled the chain policy, here's the up-to-date config https://gist.github.com/freak12techno/3b8f3672521e77e0ff35e464a8dcdd21. let me know if that helps or you need something else.

My concern here is that it did finish scanning chains, but then something weird started to happen,