Closed freak12techno closed 3 months ago
Hey @freak12techno, could you run the Hermes instance with the flag --debug=rpc
and share the full logs?
The scan is what can take a long time and/or fail, so we should indeed scan the chains in parallel, and gracefully handle any failures there. Once the scan is done, spawning the workers should be very fast and not a bottleneck at all.
@ljoss17 so I started Hermes with debug=rpc and ran it for like 10 minutes, here are the logs: https://gist.github.com/freak12techno/43d9b674388f35b000b7b22979424483. Apparently at least the wallet worker for cosmoshub-4 had never started, as I fail to see the metrics regarding cosmoshub-4 wallet balance (I see the metrics for balances of wallets on bitsong-2b, jackal-1, sentinel, same with osmosis-1 (these chains are the last in the config).
@romac agree. I also created another issue on scanning the chains in parallel as well, which should speed it up, but that goes out of the scope of this one.
@ljoss17 I seem to have the same issue again on 1.10.13 - after restarting, seems like all the chains are scanned, but somehow Hermes isn't doing anything at all. Pretty sure it's because of one of the nodes misbehaving (likely the Osmosis one), but I have paths that I can relay that are not involving Osmosis (for example DVPN <=> ATOM), and it seems like it's not behaving correctly here as well. Wonder if I should create another issue on that, or reopen this one.
Metrics (as you can see, after the restart it isn't submitting anything at all):
Using pretty much the same config as above, and here are my logs: https://gist.github.com/freak12techno/8ce404c507700d3ac73f483d5ca6d2db. Can you check this out?
What happens if you comment out the osmosis chain and all channels tagged # Osmosis
from your config?
It's weird that Hermes is scanning all clients/connections/channels on all chains. Are you still using an allowlist for each chain?
@romac sorry, I forgot that I disabled the chain policy, here's the up-to-date config https://gist.github.com/freak12techno/3b8f3672521e77e0ff35e464a8dcdd21. let me know if that helps or you need something else.
My concern here is that it did finish scanning chains, but then something weird started to happen,
Summary of Bug
Somehow Hermes fails to spin up workers (or rather is stuck spinning up workers), and apparently it's doing it sequentially, therefore if it fails to spin up workers on the first chain, other won't spin up so Hermes won't properly do anything.
Example: Here's my test config: https://gist.github.com/freak12techno/1a995d3822d5fee50e4c569298c6b8d6, and running Hermes with trace logging produces these results: https://gist.github.com/freak12techno/74f02dd94df0fd627312f0f62a90a37f. Seems like it's done spinning workers for bitsong-2b, cosmoshub-4, jackal-1, then it somehow is stuck on spinning workers for osmosis-1 (not sure why, that's another thing, likely the node not working properly) and all of the chains going after osmosis-1 in config (so sentinelhub-2 and neutron-1) are not loaded properly.
I also faced a case a few times where it's bitsong-2b which is faulty, so none of the chains are having their workers running and therefore Hermes effectively does nothing at that point for all of the chains.
I have a feeling that the workers spinning process is sequential, and the fix would be making it asynchronous, so failing to load 1 chain won't fail loading others. @ljoss17 I remember you investigating the clearing packets routine blocking Hermes functioning, which should be somewhat similar, can you check this out?
Just to clarify: my main concern is not a failure of a single chain (like in my example, Hermes failing to spin up Osmosis worker), but rather a failure of a single chain blocking Hermes from doing anything else.
Version
1.10.1
Steps to Reproduce
Acceptance Criteria
Failing to spin up a worker for one chain should not fail Hermes from working properly on others.
For Admin Use