This PR does not entirely eliminate the already-connected peer; it's more on how to handle it.
It is explained in users.rust-lang.org and stackexchange that closing the stream does not necessarily mean the connection is immediately cut.
A certain waiting time has to be allowed before reconnecting with the same tuple (address, port, etc).
I started with 10 or 15 seconds but both were too early. I've tested 20 seconds and reconnection was successful.
This wait time increases by 10 seconds, if the reconnect action is called too often (less than 30 minutes). It becomes 30 seconds, then 40, 50, etc.
However should reconnection happen after a long time (more than 30 minutes), waiting time reverts back to 20 seconds.
tokio::select is disruptive:
Waits on multiple concurrent branches, returning when the first branch completes, cancelling the remaining branches.
These I found, are one of the major results of tokio::select in the integration tests:
Proof should be available: ProofTimeout("Timeout elapsed for building proof of slot...could not find event: Redeem::ExecuteRedeem..
Since the tokio::select block of code is already inside the loop, it makes sense to just await each of these calls, without a need to choose between them.
Overhaul of the stellar-relay-lib is required, to simplify message sending/receiving between the user/agent and the Stellar Node.
the REMOVAL of message wrappers / extra message types
retries -> removed. Reconnection should be implemented outside this struct.
actions_sender -> Removed. This is replaced with a new field: write_stream_overlay which is the write half of the TcpStream
relay_message_sender -> Removed. Relaying message to user should be implemented outside this struct.
redistribute code from services.rs to the message_reader.rs (directly reading messages from the stream).
replace overlay_connection.rs to overlay.rs found outside the connector mod, as this will be for PUBLIC use.
How to begin the review:
clients/service/src/lib.rs -> implementation as mentioned on the 1st point.
clients/vault/src/oracle/agent.rs
handle_message() -> accepts StellarMessage instead of StellarRelayMessage. On the 3rd point: No more extra enums.
start_oracle_agent()
the StellarOverlayConnection has a sender that will send messages to Stellar Node. Instead of creating new sender/receiver channels, we utilize a direct sender.
On the 2nd point: since a direct sender is used, we don't need to do a tokio::select.
The disconnect_signal_sender and receiver are to signal the overlay connection inside the thread to DISCONNECT from the Stellar Node, if a shutdown is triggered.
clients/stellar-relay-lib/src/overlay.rs
replaces the current _overlay_connection.rs_
StellarOverlayConnection
has 2 fields:
sender - to send messages to Stellar Node
receiver - receive messages from Stellar Node
the connect() replaces the fn connect() of clients/stellar-relay-lib/src/connection/overlay_connection.rs
instead of 2 spawned threads, there is only 1: poll_messages_from_stellar().
contains the listen() function called by _start_oracle_agent()_ to listen to messages from Stellar Node
closes #451
General overview of the changes:
This PR does not entirely eliminate the
already-connected peer
; it's more on how to handle it. It is explained in users.rust-lang.org and stackexchange that closing the stream does not necessarily mean the connection is immediately cut. A certain waiting time has to be allowed before reconnecting with the same tuple (address, port, etc).I started with 10 or 15 seconds but both were too early. I've tested 20 seconds and reconnection was successful. This wait time increases by 10 seconds, if the reconnect action is called too often (less than 30 minutes). It becomes 30 seconds, then 40, 50, etc. However should reconnection happen after a long time (more than 30 minutes), waiting time reverts back to 20 seconds.
tokio::select
is disruptive:https://github.com/pendulum-chain/spacewalk/blob/13d25ae642cf43a65ebcb6028f5f0bc5b4e92160/clients/vault/src/oracle/agent.rs#L102-L111 The current implementation is waiting on a
listen()
which constantly/rapidly receives messages from the Stellar Node. Therecv()
happens when user (or the inner workings inside the agent) wants to send a message to the Node, and it does not occur as plenty aslisten()
. On every loop, thelisten()
completes too fast for therecv()
to be acknowledged. Here are where the messages (whichrecv()
should handle) are sent: https://github.com/pendulum-chain/spacewalk/blob/13d25ae642cf43a65ebcb6028f5f0bc5b4e92160/clients/vault/src/oracle/collector/proof_builder.rs#L96-L99 https://github.com/pendulum-chain/spacewalk/blob/13d25ae642cf43a65ebcb6028f5f0bc5b4e92160/clients/vault/src/oracle/collector/handler.rs#L39These I found, are one of the major results of
tokio::select
in the integration tests:Proof should be available: ProofTimeout("Timeout elapsed for building proof of slot...
could not find event: Redeem::ExecuteRedeem..
Since the
tokio::select
block of code is already inside the loop, it makes sense to justawait
each of these calls, without a need to choose between them.Overhaul of the
stellar-relay-lib
is required, to simplify message sending/receiving between the user/agent and the Stellar Node.Connector
struct:retries
-> removed. Reconnection should be implemented outside this struct.actions_sender
-> Removed. This is replaced with a new field:write_stream_overlay
which is the write half of the TcpStreamrelay_message_sender
-> Removed. Relaying message to user should be implemented outside this struct.connector
mod, as this will be for PUBLIC use.How to begin the review:
clients/service/src/lib.rs
-> implementation as mentioned on the 1st point.clients/vault/src/oracle/agent.rs
handle_message()
-> acceptsStellarMessage
instead ofStellarRelayMessage
. On the 3rd point: No more extra enums.start_oracle_agent()
StellarOverlayConnection
has asender
that will send messages to Stellar Node. Instead of creating new sender/receiver channels, we utilize a direct sender.sender
is used, we don't need to do atokio::select
.disconnect_signal_sender
and receiver are to signal the overlay connection inside the thread to DISCONNECT from the Stellar Node, if a shutdown is triggered.clients/stellar-relay-lib/src/overlay.rs
overlay_connection.rs
_StellarOverlayConnection
sender
- to send messages to Stellar Nodereceiver
- receive messages from Stellar Nodeconnect()
replaces thefn connect()
ofclients/stellar-relay-lib/src/connection/overlay_connection.rs
poll_messages_from_stellar()
.listen()
function called by _start_oracle_agent()
_ to listen to messages from Stellar Nodeclients/stellar-relay-lib/src/connection/connector/message_reader.rs
poll_messages_from_stellar()
send_to_node_receiver.try_recv()
-> listens for messages from the user, and then send that message to Stellarmatch read_message_from_stellar(...
-> listens for messages from Stellar; process it, and then send to user.The following changes do not need to be reviewed in order:
clients/stellar-relay-lib/examples/connect.rs
StellarMessage
, sinceStellarRelayMessage
is removed.clients/stellar-relay-lib/src/config.rs
stellar-relay-lib
will NOT handle retries anymore, hence it is removed.clients/stellar-relay-lib/src/connection/error.rs
AuthSignatureFailed
,AuthFailed
,Timeout
, andVersionStrTooLong
clients/stellar-relay-lib/src/connection/connector/message_handler.rs
Option<StellarMessage>
, instead of sending it through a sender (send_to_user()
is removed, mentioned on the 3rd point).clients/stellar-relay-lib/src/connection/connector/message_sender.rs
send_to_node(...)
accepts aStellarMessage
, converts to xdr and write directly to the write half of the stream (which is a field of theConnector
)clients/stellar-relay-lib/src/connection/connector/mod.rs
ConnectorActions
clients/stellar-relay-lib/src/connection/helper.rs
log_error
_ macro since it's not being used anymore.create_stream()
is a function from theclients/stellar-relay-lib/src/connection/services.rs
clients/stellar-relay-lib/src/connection/mod.rs
StellarRelayMessage
retries
field inConnectionInfo
. Retry is performed inclients/service/src/lib.rs