pendulum-chain / spacewalk

Apache License 2.0
34 stars 7 forks source link

Vault client indefinitely waiting for other tasks to shut down #538

Closed ebma closed 2 months ago

ebma commented 2 months ago

Context

For some non-recoverable errors, the vault client tries to restart. Before restarting, it will wait for pending tasks to shut down. It seems like not all tasks are receiving the shutdown signal, or maybe they do but are still stuck. This causes the vault client to wait indefinitely (not even the periodic restart will work here).

The following incidents happened on lower-spec machines, so maybe they are more likely to occur when the clients don't have much resources available.

TODO

Try to find the tasks that are not successfully shut down.

Incident 1

root@ip-172-31-88-54 ~]# kubectl logs spacewalk-amplitude-tzs-05-65ccff6fdd-m5h9n  -n spacewalk -f
[2024-06-12T10:00:25Z INFO  runner] Connected to the parachain
[2024-06-12T10:00:25Z INFO  runner::runner] Downloading vault-amplitude at: "./vault-amplitude"
[2024-06-12T10:00:25Z INFO  runner::runner] Fetching executable from https://github.com/pendulum-chain/spacewalk/releases/download/v1.0.8/vault-amplitude
[2024-06-12T10:00:25Z INFO  runner::runner] Client started, with pid 14
Jun 12 10:00:25.886  INFO vault: Starting Prometheus exporter at http://0.0.0.0:9616
Jun 12 10:00:25.886  INFO Server::run{addr=0.0.0.0:9616}: warp::server: listening on http://0.0.0.0:9616
Jun 12 10:00:25.889  INFO vault::process: Creating PID file at: /tmp/amplitude_5H1NjwFazxmz1sLJU3Hv3K2HmkYPrEd6h3jZRYkPS4wUcjhN.pid
Jun 12 10:00:25.895  INFO service: Version: 1.0.8
Jun 12 10:00:25.895  INFO service: Vault uses Substrate account with ID: 6mpGcpNdmKttMkQG42VGhq1fU2gofz4uXA3XeoPuNtrnqZLM
Jun 12 10:00:25.895  INFO runtime::conn: Connecting to the spacewalk-parachain...    
Jun 12 10:00:26.079  INFO runtime::conn: Connected!    
Jun 12 10:00:26.417  INFO runtime::rpc: spec_name="amplitude"    
Jun 12 10:00:26.417  INFO runtime::rpc: spec_version=17    
Jun 12 10:00:26.417  INFO runtime::rpc: transaction_version=13    
Jun 12 10:00:26.419  INFO wallet::cache: Caching stellar transactions at .//GD7Y5F3NSZWJZ553PZ5ACAQGDQ4BYKUXYIBIRDECPNO2VBDHUBYCWKLU_true/txs
Jun 12 10:00:37.079  INFO vault::system: Got new block at height 3994978
Jun 12 10:00:37.079  INFO vault::system: Starting client service...
Jun 12 10:00:37.133  INFO vault::system: Not registering public key -- already registered
Jun 12 10:00:37.186  INFO vault::system: [6mpGcpNdmKttMkQG42VGhq1fU2gofz4uXA3XeoPuNtrnqZLM[XCM(0)->{ code: TZS, issuer: GA2MSSZKJOU6RNL3EJKH3S5TB5CDYTFQFWRYFGUJVIN5I6AOIRTLUHTO }]] Not registering vault -- already registered
Jun 12 10:00:37.385  INFO vault::system: Adding vault with ID: VaultId { account_id: daa314a1d9cf7438138521306850ab8f8f20720212358de332bf2ab4f56dfb03 (5H1NjwFa...), currencies: VaultCurrencyPair { collateral: XCM(0), wrapped: { code: TZS, issuer: GA2MSSZKJOU6RNL3EJKH3S5TB5CDYTFQFWRYFGUJVIN5I6AOIRTLUHTO } } }
Jun 12 10:00:37.385  INFO vault::oracle::agent: start_oracle_agent(): Starting connection to Stellar overlay network...
Jun 12 10:00:37.387  INFO stellar_relay_lib::config: connection_info(): Connecting to Stellar overlay network using public key: GD7Y5F3NSZWJZ553PZ5ACAQGDQ4BYKUXYIBIRDECPNO2VBDHUBYCWKLU
Jun 12 10:00:37.387  INFO stellar_relay_lib::overlay: connect(): connecting to ConnectionInfo { address: "85.190.254.217", port: 11625, secret_key: "****", auth_cert_expiration: 0, receive_tx_messages: false, receive_scp_messages: true, remote_called_us: false, timeout_in_seconds: 10 }
Jun 12 10:00:37.489  INFO stellar_relay_lib::connection::connector::message_reader: poll_messages_from_stellar(): started.
Jun 12 10:00:37.601  INFO stellar_relay_lib::connection::connector::message_handler: process_stellar_message(): Hello message processed successfully
Jun 12 10:00:38.291  INFO vault::requests::execution: Processing valid Stellar payment for open Redeem request #0x9d41…26da: 
Jun 12 10:00:38.291  INFO vault::system: Done processing open requests
Jun 12 10:00:38.291  WARN vault::oracle::collector::proof_builder: build_proof(): Couldn't build proof for slot 52076066 due to missing envelopes
Jun 12 10:00:38.372 ERROR vault::oracle::collector::proof_builder: get_envelopes_from_horizon_archive(): Could not get SCPArchive for slot 52076066 from Horizon Archive: StdIoError(Custom { kind: InvalidInput, error: "invalid gzip header" })
Jun 12 10:00:40.929  INFO vault::oracle::collector::proof_builder: get_envelopes_from_horizon_archive(): Adding 23 archived SCP envelopes for slot 52076066 to envelopes map. 23 are externalized
Jun 12 10:00:48.292  INFO vault::oracle::collector::proof_builder: get_txset_from_horizon_archive(): Fetching TxSet for slot 52076066 from horizon archive
Jun 12 10:00:48.292  WARN vault::oracle::collector::proof_builder: get_txset(): Proof Building for slot 52076066: no txset found
Jun 12 10:00:48.687  INFO vault::oracle::collector::proof_builder: get_txset_from_horizon_archive(): Adding archived tx set for slot 52076066
Jun 12 10:00:55.518  INFO vault::system: Starting all services...
Jun 12 10:00:57.165  INFO wallet::horizon::horizon: Adding transaction Ok("14d9ef19ce9f4f457601cb923e8c023942a82c643af10ca26864b2bd5e3d05d9") with slot 52075927 to the ledger_env_map
Jun 12 10:00:57.165  INFO wallet::horizon::horizon: Adding transaction Ok("400e0f32db24aaf517853a019c161d06354dccd7fdb2f2b3a67d575b97cfa098") with slot 52083412 to the ledger_env_map
Jun 12 10:00:58.294  INFO vault::oracle::agent: get_proof(): Successfully build proof for slot 52076066
Jun 12 10:01:00.520  WARN vault::oracle::collector::proof_builder: build_proof(): Couldn't build proof for slot 52075927 due to missing envelopes
Jun 12 10:01:00.520  WARN vault::oracle::collector::proof_builder: build_proof(): Couldn't build proof for slot 52083412 due to missing envelopes
Jun 12 10:01:00.583 ERROR vault::oracle::collector::proof_builder: get_envelopes_from_horizon_archive(): Could not get SCPArchive for slot 52075927 from Horizon Archive: StdIoError(Custom { kind: InvalidInput, error: "invalid gzip header" })
Jun 12 10:01:00.618  INFO vault::oracle::collector::proof_builder: get_envelopes_from_horizon_archive(): Adding 22 archived SCP envelopes for slot 52083412 to envelopes map. 22 are externalized
Jun 12 10:01:01.071  INFO vault::oracle::collector::proof_builder: get_envelopes_from_horizon_archive(): Adding 22 archived SCP envelopes for slot 52075927 to envelopes map. 22 are externalized
Jun 12 10:01:10.521  INFO vault::oracle::collector::proof_builder: get_txset_from_horizon_archive(): Fetching TxSet for slot 52075927 from horizon archive
Jun 12 10:01:10.521  WARN vault::oracle::collector::proof_builder: get_txset(): Proof Building for slot 52075927: no txset found
Jun 12 10:01:10.521  INFO vault::oracle::collector::proof_builder: get_txset_from_horizon_archive(): Fetching TxSet for slot 52083412 from horizon archive
Jun 12 10:01:10.521  WARN vault::oracle::collector::proof_builder: get_txset(): Proof Building for slot 52083412: no txset found
Jun 12 10:01:10.765  INFO vault::oracle::collector::proof_builder: get_txset_from_horizon_archive(): Adding archived tx set for slot 52083412
Jun 12 10:01:10.980  INFO vault::oracle::collector::proof_builder: get_txset_from_horizon_archive(): Adding archived tx set for slot 52075927
Jun 12 10:01:20.524  INFO vault::oracle::agent: get_proof(): Successfully build proof for slot 52083412
Jun 12 10:01:20.526  INFO vault::oracle::agent: get_proof(): Successfully build proof for slot 52075927
Jun 12 10:01:29.200  INFO vault::requests::structs: Successfully executed Redeem request #0x9d41…26da
Jun 12 10:01:29.200  INFO vault::requests::execution: Successfully executed Redeem request #0x9d41…26da
Jun 12 10:02:06.613  INFO vault::issue: Successfully executed Issue #0x970c5c199364cc68e3b18525fcdf1d46a171a6292f6acb7cd5b9f1bd558898de for slot 52075927
Jun 12 10:06:20.526  WARN runtime::rpc: Timeout on transaction submission - restart required    
Jun 12 10:06:20.530  INFO vault::system: try_shutdown_wallet(): stop the resubmission scheduler
Jun 12 10:06:20.530  WARN service: Disconnected
Jun 12 10:06:20.530 ERROR service: Waiting for 1 tasks to shut down...
Jun 12 10:06:21.532 ERROR service: Waiting for 1 tasks to shut down...

Incident 2

[root@ip-172-31-88-54 ~]# kubectl logs spacewalk-amplitude-usdc-01-5547c4d656-6qp2r  -n spacewalk -f
[2024-06-12T09:30:31Z INFO  runner] Connected to the parachain
[2024-06-12T09:30:31Z INFO  runner::runner] Downloading vault-amplitude at: "./vault-amplitude"
[2024-06-12T09:30:31Z INFO  runner::runner] Fetching executable from https://github.com/pendulum-chain/spacewalk/releases/download/v1.0.8/vault-amplitude
Jun 12 09:30:32.114  INFO vault: Starting Prometheus exporter at http://0.0.0.0:9616
Jun 12 09:30:32.115  INFO Server::run{addr=0.0.0.0:9616}: warp::server: listening on http://0.0.0.0:9616
[2024-06-12T09:30:32Z INFO  runner::runner] Client started, with pid 13
Jun 12 09:30:32.118  INFO vault::process: Creating PID file at: /tmp/amplitude_5GnBJ2pg26hAqR4jb2xfub26LHPVCzAFS5JnggAc8g3mZiDP.pid
Jun 12 09:30:32.120  INFO service: Version: 1.0.8
Jun 12 09:30:32.120  INFO service: Vault uses Substrate account with ID: 6mb5AuwinTp5BJ8hB2A2a71U2ZXu2jc4GBckuvp85Vy5nb1s
Jun 12 09:30:32.120  INFO runtime::conn: Connecting to the spacewalk-parachain...    
Jun 12 09:30:32.256  INFO runtime::conn: Connected!    
Jun 12 09:30:32.502  INFO runtime::rpc: spec_name="amplitude"    
Jun 12 09:30:32.502  INFO runtime::rpc: spec_version=17    
Jun 12 09:30:32.502  INFO runtime::rpc: transaction_version=13    
Jun 12 09:30:32.505  INFO wallet::cache: Caching stellar transactions at .//GCODYRWTVB5EXKLAU2COTFWFI5AU7FAQ7BWJVRWP4XBCVDB6AMNFUZGR_true/txs
Jun 12 09:30:36.813  INFO vault::system: Got new block at height 3994881
Jun 12 09:30:36.813  INFO vault::system: Starting client service...
Jun 12 09:30:36.872  INFO vault::system: Not registering public key -- already registered
Jun 12 09:30:36.935  INFO vault::system: [6mb5AuwinTp5BJ8hB2A2a71U2ZXu2jc4GBckuvp85Vy5nb1s[XCM(0)->{ code: USDC, issuer: GA5ZSEJYB37JRC5AVCIA5MOP4RHTM335X2KGX3IHOJAPP5RE34K4KZVN }]] Not registering vault -- already registered
Jun 12 09:30:37.150  INFO vault::system: Adding vault with ID: VaultId { account_id: d0925b8a6233c50f35eab7236336b4a90a4ec0435bb45d0e525918a62fbb4034 (5GnBJ2pg...), currencies: VaultCurrencyPair { collateral: XCM(0), wrapped: { code: USDC, issuer: GA5ZSEJYB37JRC5AVCIA5MOP4RHTM335X2KGX3IHOJAPP5RE34K4KZVN } } }
Jun 12 09:30:37.151  INFO vault::oracle::agent: start_oracle_agent(): Starting connection to Stellar overlay network...
Jun 12 09:30:37.153  INFO stellar_relay_lib::config: connection_info(): Connecting to Stellar overlay network using public key: GCODYRWTVB5EXKLAU2COTFWFI5AU7FAQ7BWJVRWP4XBCVDB6AMNFUZGR
Jun 12 09:30:37.153  INFO stellar_relay_lib::overlay: connect(): connecting to ConnectionInfo { address: "85.190.254.217", port: 11625, secret_key: "****", auth_cert_expiration: 0, receive_tx_messages: false, receive_scp_messages: true, remote_called_us: false, timeout_in_seconds: 10 }
Jun 12 09:30:37.257  INFO stellar_relay_lib::connection::connector::message_reader: poll_messages_from_stellar(): started.
Jun 12 09:30:37.361  INFO stellar_relay_lib::connection::connector::message_handler: process_stellar_message(): Hello message processed successfully
Jun 12 09:30:50.305  INFO vault::system: Done processing open requests
Jun 12 09:30:51.069  INFO vault::requests::execution: Failed to process open Redeem request #0x53b2d346a606f97b3e63f3fd11c5f21d01119a49418998c019974279bcd1fcf0 due to error: Deadline has expired
Jun 12 09:30:57.504  INFO vault::system: Starting all services...
Jun 12 09:31:02.856  INFO wallet::horizon::horizon: Adding transaction Ok("7da4fbd8fce4aa15f3e565ac89a15455bce82b5ccf0c3089a493ffe766490879") with slot 52083415 to the ledger_env_map
Jun 12 09:31:07.517  WARN vault::oracle::collector::proof_builder: build_proof(): Couldn't build proof for slot 52083415 due to missing envelopes
Jun 12 09:31:07.845  INFO vault::oracle::collector::proof_builder: get_envelopes_from_horizon_archive(): Adding 22 archived SCP envelopes for slot 52083415 to envelopes map. 22 are externalized
Jun 12 09:31:17.518  INFO vault::oracle::collector::proof_builder: get_txset_from_horizon_archive(): Fetching TxSet for slot 52083415 from horizon archive
Jun 12 09:31:17.518  WARN vault::oracle::collector::proof_builder: get_txset(): Proof Building for slot 52083415: no txset found
Jun 12 09:31:18.067  INFO vault::oracle::collector::proof_builder: get_txset_from_horizon_archive(): Adding archived tx set for slot 52083415
Jun 12 09:31:27.523  INFO vault::oracle::agent: get_proof(): Successfully build proof for slot 52083415
Jun 12 09:32:18.989  INFO vault::issue: Successfully executed Issue #0x4bb291364e0b11623248085998035d34874464a97ea62f29f63f1df44c886540 for slot 52083415
Jun 12 10:00:18.022 ERROR jsonrpsee_core::client::async_client: [backend]: Networking or low-level protocol error: WebSocket connection error: connection closed
Jun 12 10:00:18.022  INFO vault::system: try_shutdown_wallet(): stop the resubmission scheduler
Jun 12 10:00:18.023  WARN service: Disconnected: RuntimeError: Channel closed unexpectedly
Jun 12 10:00:18.024 ERROR service: Waiting for 1 tasks to shut down...
Jun 12 10:00:19.025 ERROR service: Waiting for 1 tasks to shut down...
Jun 12 10:00:20.026 ERROR service: Waiting for 1 tasks to shut down...
Jun 12 10:00:21.028 ERROR service: Waiting for 1 tasks to shut down...

Incident 3

Jun 12 12:00:18.214 DEBUG hyper::client::pool: pooling idle connection for ("https", horizon.stellar.org)
Jun 12 12:00:18.214 DEBUG wallet::horizon::horizon: accessing url: "https://horizon.stellar.org/accounts/GBTK4OUWFW5FKTTT26HFCNIVNQDISLRZNUKP7GSAPLZVV25V7RWTBBQX/transactions?cursor=223716604407029760&limit=200&order=asc"
Jun 12 12:00:18.214 DEBUG hyper::client::pool: reuse idle connection for ("https", horizon.stellar.org)
Jun 12 12:00:18.214 DEBUG hyper::proto::h1::io: flushed 182 bytes
Jun 12 12:00:18.220 DEBUG hyper::proto::h1::io: parsed 7 headers
Jun 12 12:00:18.224 DEBUG hyper::proto::h1::conn: incoming body is content-length (661 bytes)
Jun 12 12:00:18.224 DEBUG hyper::proto::h1::conn: incoming body completed
Jun 12 12:00:18.224 DEBUG hyper::client::pool: pooling idle connection for ("https", horizon.stellar.org)
Jun 12 12:00:22.496 ERROR jsonrpsee_core::client::async_client: [backend]: Networking or low-level protocol error: WebSocket connection error: connection closed
Jun 12 12:00:22.496 DEBUG rustls::common_state: Sending warning alert CloseNotify    
[2024-06-12T12:00:22Z DEBUG rustls::common_state] Sending warning alert CloseNotify
Jun 12 12:00:22.501 DEBUG vault::oracle::agent: stop(): Shutting down OracleAgent...
Jun 12 12:00:22.505  INFO vault::system: try_shutdown_wallet(): stop the resubmission scheduler
Jun 12 12:00:22.505  WARN service: Disconnected: RuntimeError: Channel closed unexpectedly
Jun 12 12:00:22.505 ERROR service: Waiting for 2 tasks to shut down...
Jun 12 12:00:22.505 DEBUG wallet::resubmissions: start_periodic_resubmission_of_transactions_from_cache(): scheduler stopped.
Jun 12 12:00:22.504 DEBUG vault::oracle::agent: start_oracle_agent(): sending signal to shutdown overlay connection...
Jun 12 12:00:23.509 ERROR service: Waiting for 1 tasks to shut down...
[2024-06-12T12:00:23Z WARN  runner::runner] Error reading chain storage for release: SubxtError: Rpc error: RPC error: The background task been terminated because: Networking or low-level protocol error: WebSocket connection error: connection closed; restart required. Retrying...
Jun 12 12:00:24.512 ERROR service: Waiting for 1 tasks to shut down...
Jun 12 12:00:25.518 ERROR service: Waiting for 1 tasks to shut down...
Jun 12 12:00:37.556 ERROR service: Waiting for 1 tasks to shut down...
Jun 12 12:00:52.636 ERROR service: Waiting for 1 tasks to shut down...
ebma commented 2 months ago

@pendulum-chain/product this is a bug that seems to occur more frequently when running the client in a cluster. We should try to investigate and fix this soon.

ebma commented 2 months ago

@b-yap I checked the logs of a successful restart on a Pendulum vault and it looks like this

Jun 20 09:00:13.449  INFO stellar_relay_lib::connection::connector::message_reader: poll_messages_from_stellar(): started.
Jun 20 09:00:13.463  INFO stellar_relay_lib::connection::connector::message_handler: process_stellar_message(): Hello message processed successfully
Jun 20 09:00:15.459  INFO vault::system: Done processing open requests
Jun 20 09:00:22.957  INFO vault::system: Starting all services...
Jun 20 10:00:17.112 ERROR jsonrpsee_core::client::async_client: [backend]: Networking or low-level protocol error: WebSocket connection error: connection closed
Jun 20 10:00:17.114  INFO vault::system: try_shutdown_wallet(): stop the resubmission scheduler
Jun 20 10:00:17.114  WARN service: Disconnected: RuntimeError: Channel closed unexpectedly
Jun 20 10:00:17.114 ERROR service: Waiting for 2 tasks to shut down...
Jun 20 10:00:17.494  INFO vault::oracle::agent: start_oracle_agent(): disconnect overlay...
Jun 20 10:00:17.494  INFO stellar_relay_lib::overlay: stop(): closing connection to overlay network
Jun 20 10:00:17.494  INFO stellar_relay_lib::overlay: stop(): closing connection to overlay network
Jun 20 10:00:17.596  INFO stellar_relay_lib::connection::connector::message_reader: poll_messages_from_stellar(): closing receiver during disconnection
Jun 20 10:00:18.115  INFO service: All tasks successfully shut down
Jun 20 10:00:18.115  INFO service: Restarting in 30 seconds

and comparing it to the logs of the incidents it seems like this is missing

Jun 20 10:00:17.494  INFO stellar_relay_lib::overlay: stop(): closing connection to overlay network

Maybe this is the task unable to shut down.

b-yap commented 2 months ago

@ebma The waiting of x tasks to shutdown is only for those that are calling ShutdownSender's fn subscribe(). Yes, I think it's related to the stellar_relay_lib; specifically https://github.com/pendulum-chain/spacewalk/blob/53f81a2dac91d523eaf7178d01a402a0fa536d55/clients/vault/src/oracle/agent.rs#L87-L135

I am trying to rewrite the code, on how to better handle this. This was difficult from the start. See the attempt of stopping this loop with 2 ShutdownSender subscription at line 87 and the on_shutdown(... at line 130.

ebma commented 2 months ago

Hmm you are right, that's likely the problem 🤔

b-yap commented 2 months ago

Fixed with https://github.com/pendulum-chain/spacewalk/pull/539