Closed ebma closed 4 months ago
@pendulum-chain/product this is a bug that seems to occur more frequently when running the client in a cluster. We should try to investigate and fix this soon.
@b-yap I checked the logs of a successful restart on a Pendulum vault and it looks like this
Jun 20 09:00:13.449 INFO stellar_relay_lib::connection::connector::message_reader: poll_messages_from_stellar(): started.
Jun 20 09:00:13.463 INFO stellar_relay_lib::connection::connector::message_handler: process_stellar_message(): Hello message processed successfully
Jun 20 09:00:15.459 INFO vault::system: Done processing open requests
Jun 20 09:00:22.957 INFO vault::system: Starting all services...
Jun 20 10:00:17.112 ERROR jsonrpsee_core::client::async_client: [backend]: Networking or low-level protocol error: WebSocket connection error: connection closed
Jun 20 10:00:17.114 INFO vault::system: try_shutdown_wallet(): stop the resubmission scheduler
Jun 20 10:00:17.114 WARN service: Disconnected: RuntimeError: Channel closed unexpectedly
Jun 20 10:00:17.114 ERROR service: Waiting for 2 tasks to shut down...
Jun 20 10:00:17.494 INFO vault::oracle::agent: start_oracle_agent(): disconnect overlay...
Jun 20 10:00:17.494 INFO stellar_relay_lib::overlay: stop(): closing connection to overlay network
Jun 20 10:00:17.494 INFO stellar_relay_lib::overlay: stop(): closing connection to overlay network
Jun 20 10:00:17.596 INFO stellar_relay_lib::connection::connector::message_reader: poll_messages_from_stellar(): closing receiver during disconnection
Jun 20 10:00:18.115 INFO service: All tasks successfully shut down
Jun 20 10:00:18.115 INFO service: Restarting in 30 seconds
and comparing it to the logs of the incidents it seems like this is missing
Jun 20 10:00:17.494 INFO stellar_relay_lib::overlay: stop(): closing connection to overlay network
Maybe this is the task unable to shut down.
@ebma The waiting of x tasks to shutdown is only for those that are calling ShutdownSender
's fn subscribe()
.
Yes, I think it's related to the stellar_relay_lib; specifically
https://github.com/pendulum-chain/spacewalk/blob/53f81a2dac91d523eaf7178d01a402a0fa536d55/clients/vault/src/oracle/agent.rs#L87-L135
I am trying to rewrite the code, on how to better handle this. This was difficult from the start. See the attempt of stopping this loop with 2 ShutdownSender
subscription at line 87 and the on_shutdown(...
at line 130.
Hmm you are right, that's likely the problem 🤔
Context
For some non-recoverable errors, the vault client tries to restart. Before restarting, it will wait for pending tasks to shut down. It seems like not all tasks are receiving the shutdown signal, or maybe they do but are still stuck. This causes the vault client to wait indefinitely (not even the periodic restart will work here).
The following incidents happened on lower-spec machines, so maybe they are more likely to occur when the clients don't have much resources available.
TODO
Try to find the tasks that are not successfully shut down.
Incident 1
Incident 2
Incident 3