pendulum-chain / spacewalk

Apache License 2.0
34 stars 7 forks source link

Allow Tokio Console support #517

Closed b-yap closed 3 months ago

b-yap commented 4 months ago

This does not completely close #512, but it helps with tracking.

But if we can see which threads have been running the longest; which ones are idle for a long time, it could help determine where/when a vault gets stuck.

The tokio-console is described as:

a debugging and profiling tool for asynchronous Rust applications, which collects and displays in-depth diagnostic data on the asynchronous tasks, resources, and operations in an application.

^ From using tokio-console, a zombie task was found. When the vault restarts, it shuts down all running tasks except for 1: https://github.com/pendulum-chain/spacewalk/blob/e7a672ba1e21c98a70df30a6ee458317951dd597/clients/wallet/src/resubmissions.rs#L40-L47
Aside from this, I found out that the stream also gets stuck: https://github.com/pendulum-chain/spacewalk/blob/e7a672ba1e21c98a70df30a6ee458317951dd597/clients/stellar-relay-lib/src/connection/connector/message_reader.rs#L82 Requires a timeout to trigger reconnection.


How to begin the review:

I have added comments to help with the review.

  1. all affected toml files:
    • tracing feature of tokio is necessary for tokio-console
    • console-subscriber dependency is necessary for tokio-console
  2. clients/wallet/src/stellar_wallet.rs:
    • I needed a channel to stop the task, the reason for a new field of StellarWallet:
          /// a sender to 'stop' a scheduled resubmission task
      pub(crate) resubmission_end_signal: Option<mpsc::Sender<()>>,
  3. clients/wallet/src/resubmissions.rs:
    • Added a fn stop_periodic_resubmission_of_transactions(&mut self)
    • Update inside fn start_periodic_resubmission_of_transactions_from_cache(...):
      • create a new channel and store the sender in StellarWallet's _resubmission_end_signal_ field.
      • inside the spawned task, break the loop once the receiver receives a shutdown signal
          loop {
        // a shutdown message was sent. Stop the loop.
        if let Some(_) = receiver.recv().await {
            break;
        }
        ...        
  4. clients/vault/src/system.rs:

    • calls the fn stop_periodic_resubmission_of_transactions(...) from the wallet field of VaultService:
          /// shuts down the resubmission task running in the background
      async fn shutdown_wallet(&self) {
       ...
       let mut wallet = self.stellar_wallet.write().await;
       wallet.stop_periodic_resubmission_of_transactions().await;
      ...
      }
    • calls the fn shutdown_wallet(...) when the service stops.

          async fn start(&mut self) -> Result<(), ServiceError<Error>> {
      let result = self.run_service().await;
      
      self.shutdown_wallet().await;
      ...
  5. clients/stellar-relay-lib/src/connection/connector/message_reader.rs:
    • introduce READ_TIMEOUT_IN_SECS; how long to wait on reading from the stream.
    • add a timeout to reading the stream
         timeout(
           Duration::from_secs(READ_TIMEOUT_IN_SECS),
           connector.tcp_stream.read(&mut buff_for_reading),
          ).await
  6. clients/README.md:
    • documentation on how to use tokio-console
b-yap commented 3 months ago

@ebma Would be great if we have the vaults running with this feature. I bet it'll look very different (and interesting).

ebma commented 3 months ago

True, but I'm not sure if we can easily connect to the clients with tokio-console if they are running in a docker container.

@zoveress in this PR we added support for a profiling tool that lets us investigate resource consumption of tasks within our process. As you can see here, it's fairly simple to run. As far as I understand, the changes we made to the client will make it expose a new endpoint (by default) on port 6669 that is used by the tokio-console tool for accessing the debug information. Using this setup would be very easy to do in the EC2 instances but maybe not in our Kubernetes cluster. @zoveress do you have an idea how we could be able to access it also on Kubernetes?

This is not super important but would be nice to have.

zoveress commented 3 months ago

Is the endpoint password secured?

ebma commented 3 months ago

No it's not and I think it's only supposed to be used locally. Can we expose it to the GitLab runner only and access it from there somehow?