nasa / SBN

38 stars 23 forks source link

Issues when a disconnected peer tries to reconnect #4

Closed SweeWarman closed 4 years ago

SweeWarman commented 4 years ago

Peer subscriptions are not being reset when a peer is disconnected. This causes issues when the peer tries to reconnect.

Potential fix: Reset peer subscriptions some where in this function: https://github.com/nasa/SBN/blob/master/fsw/src/sbn_app.c#L1554:

memset(Peer->Subs,0,sizeof(SBN_Subs_t)*(SBN_MAX_SUBS_PER_PEER+1));
Peer->SubCnt = 0; 

The RcvMsg function at: https://github.com/nasa/SBN/blob/master/fsw/src/sbn_app.c#L551 should be using CFE_SB_POLL so that it doesn't block.

CDKnightNASA commented 4 years ago

Curious what issues you're seeing when the peer tries to reconnect?

It may be "over-subscribed" if the peer reconnects and does not have as many MID's subscribed as previous, but the peer will just receive MID's that nobody on the local SB are listening to.

The RcvMsg function blocks IF you're using the multi-task SBN model (controlled via #defines "SBN_RECV_TASK" and "SBN_SEND_TASK". If these are defined, each peer gets its own pair of tasks, one blocks on the read from the SB, the other on the read from the network connection to that peer.

SweeWarman commented 4 years ago

The issue we've been having is that one node doesn't receive the messages it subscribes to, from the other node after reconnecting. I've tried to summarize the issue below and my understanding as to why this may be happening. Please correct me if I've misunderstood how SBN works.

Let's suppose you have two nodes A and B each running cFS and connected over SBN. A and B connect successfully and start sharing message X across SBN. Now let's say the cFS instance on B crashes and restarts. During this restart, Node A identifies that B has disconnected. When B comes back online, both A and B send each other their local subscriptions. However, when A receives B's subscription for message X, the IsPeerSubMsgId function (https://github.com/nasa/SBN/blob/master/fsw/src/sbn_subs.c#L316) still returns true even though B crashed and disconnected. Consequently, A doesn't send B the requested message X. Please note that Node A still receives message X from B. However, B doesn't receive message X from A. This situation can be resolved by ensuring the peer subscription information is reset when a peer is disconnected.

CDKnightNASA commented 4 years ago

I've committed a fix, simply need to reset SubCnt and while the array will contain entries, they are not used and overwritten when new subs come in. I am doing this in the SBN_Connected() function.