pendulum-chain / spacewalk

Apache License 2.0
34 stars 7 forks source link

451 reduce the "already-connected peer" error logs from appearing #454

Closed b-yap closed 9 months ago

b-yap commented 9 months ago

closes #451

General overview of the changes:

  1. This PR does not entirely eliminate the already-connected peer; it's more on how to handle it. It is explained in users.rust-lang.org and stackexchange that closing the stream does not necessarily mean the connection is immediately cut. A certain waiting time has to be allowed before reconnecting with the same tuple (address, port, etc).

    I started with 10 or 15 seconds but both were too early. I've tested 20 seconds and reconnection was successful. This wait time increases by 10 seconds, if the reconnect action is called too often (less than 30 minutes). It becomes 30 seconds, then 40, 50, etc. However should reconnection happen after a long time (more than 30 minutes), waiting time reverts back to 20 seconds.

  2. tokio::select is disruptive:

    Waits on multiple concurrent branches, returning when the first branch completes, cancelling the remaining branches.

    https://github.com/pendulum-chain/spacewalk/blob/13d25ae642cf43a65ebcb6028f5f0bc5b4e92160/clients/vault/src/oracle/agent.rs#L102-L111 The current implementation is waiting on a listen() which constantly/rapidly receives messages from the Stellar Node. The recv() happens when user (or the inner workings inside the agent) wants to send a message to the Node, and it does not occur as plenty as listen(). On every loop, the listen() completes too fast for the recv() to be acknowledged. Here are where the messages (which recv() should handle) are sent: https://github.com/pendulum-chain/spacewalk/blob/13d25ae642cf43a65ebcb6028f5f0bc5b4e92160/clients/vault/src/oracle/collector/proof_builder.rs#L96-L99 https://github.com/pendulum-chain/spacewalk/blob/13d25ae642cf43a65ebcb6028f5f0bc5b4e92160/clients/vault/src/oracle/collector/handler.rs#L39

    These I found, are one of the major results of tokio::select in the integration tests: Proof should be available: ProofTimeout("Timeout elapsed for building proof of slot... could not find event: Redeem::ExecuteRedeem..

    Since the tokio::select block of code is already inside the loop, it makes sense to just await each of these calls, without a need to choose between them.

  3. Overhaul of the stellar-relay-lib is required, to simplify message sending/receiving between the user/agent and the Stellar Node.


How to begin the review:

  1. clients/service/src/lib.rs -> implementation as mentioned on the 1st point.
  2. clients/vault/src/oracle/agent.rs
    • handle_message() -> accepts StellarMessage instead of StellarRelayMessage. On the 3rd point: No more extra enums.
    • start_oracle_agent()
      • the StellarOverlayConnection has a sender that will send messages to Stellar Node. Instead of creating new sender/receiver channels, we utilize a direct sender.
      • On the 2nd point: since a direct sender is used, we don't need to do a tokio::select.
      • The disconnect_signal_sender and receiver are to signal the overlay connection inside the thread to DISCONNECT from the Stellar Node, if a shutdown is triggered.
  3. clients/stellar-relay-lib/src/overlay.rs
    • replaces the current _overlay_connection.rs_
    • StellarOverlayConnection
      • has 2 fields:
        • sender - to send messages to Stellar Node
        • receiver - receive messages from Stellar Node
      • the connect() replaces the fn connect() of clients/stellar-relay-lib/src/connection/overlay_connection.rs
        • instead of 2 spawned threads, there is only 1: poll_messages_from_stellar().
      • contains the listen() function called by _start_oracle_agent()_ to listen to messages from Stellar Node
  4. clients/stellar-relay-lib/src/connection/connector/message_reader.rs
    • poll_messages_from_stellar()
      • send_to_node_receiver.try_recv() -> listens for messages from the user, and then send that message to Stellar
      • match read_message_from_stellar(... -> listens for messages from Stellar; process it, and then send to user.
      • outside the loop, make sure to close all channels and the TcpStream

The following changes do not need to be reviewed in order:

b-yap commented 9 months ago

@ebma I had to revert from 3 to 2 for the test test_get_proof_for_current_slot because it kept failing on the CI.