Testing Thali-iOS: BrowserManager throws illegalPeerID

lesn1kk commented 7 years ago

Tried to run multiple times in a row Can shift data via parallel connections test and at some point it was stuck like in infinite loop, because browserManager throws illegalPeerID event that triggers connect retry logic, which tries to connect again and so on. From logs we can see that it looks like it is proper. Logs

enricogior commented 7 years ago

2017-06-12 12:29:15 - DEBUG thaliMobileNativeWrapper: 'Received peer availability changed event with {"peerIdentifier":"4FC24DF5-68D0-441E-89E1-BA33F8B36327","peerAvailable":false,"portNumber":null,"recreated":true}'
2017-06-12 12:29:15 - DEBUG thaliMobileNativeWrapper: 'Filtered out nonTCPPeerAvailabilityChangedEvent due to not being in started state'
2017-06-12 12:29:15 - DEBUG thaliMobileNativeWrapper: 'Received peer availability changed event with {"peerIdentifier":"4FC24DF5-68D0-441E-89E1-BA33F8B36327","peerAvailable":true,"portNumber":null,"recreated":true}'
2017-06-12 12:29:15 - DEBUG thaliMobileNativeWrapper: 'Filtered out nonTCPPeerAvailabilityChangedEvent due to not being in started state'
2017-06-12 12:29:15 - DEBUG thaliMobileNativeTestUtils: 'Issuing multiConnect for 4FC24DF5-68D0-441E-89E1-BA33F8B36327 (syncValue: lSZ62wRvisNhbA5Qcw8QxIeuLyz5SQ0B)'
2017-06-12 12:29:15 - DEBUG thaliMobileNativeTestUtils: 'Got 'multiConnect' callback'
[ThaliCore] BrowserManager.connectToPeer(_:syncValue:completion:) peer:4FC24DF5-68D0-441E-89E1-BA33F8B36327
[ThaliCore] Browser.inviteToConnect(_:sessionConnected:sessionNotConnected:) Peer(uuid: "4FC24DF5-68D0-441E-89E1-BA33F8B36327", generation: 0)
[ThaliCore] BrowserManager.connectToPeer(_:syncValue:completion:) error: illegalPeerID

I suspect the code that throws the exception is this: https://github.com/thaliproject/thali-ios/blob/e5c34a5aeaa442ce0ebdb20102aa2eb404c01437/ThaliCore/MultipeerConnectivity/Browser.swift#L154 The error code is confusing, since in this case it's a valid peer ID, just one not found in the list of available peers. I'll open an issue to change the error description.

The reason why the peer is not anymore in the list of available peers, seems to be that we received an event for peer 4FC24DF5-68D0-441E-89E1-BA33F8B36327 with an available status set to false followed by another event with the status set to true. It could be that the second event has not been correctly handled.

lesn1kk commented 7 years ago

Events for peer 4FC24DF5-68D0-441E-89E1-BA33F8B36327 with available: false and available: true one after another come from thaliMobileNativeWrapper. When the error Could not establish connections occur, the peer is recreated with peerAvailable set to false, and right after that with true, so it would trigger listeners on nonTCPPeerAvailabilityChangedEvent.

Since we are listening for peerAvailabilityChanged events, we are not reacting at all, so we wont try to reconnect. I need to check if this error is caused by latest tests changes or by ThaliCore changes.

enricogior commented 7 years ago

If nonTCPPeerAvailabilityChangedEvent turning off and on is producing consistently it may be caused by one of our components, but it might also be something random that doesn't depend on ThaliCore or the tests. So far I haven't changed anything in ThaliCore that explicitly can cause that off/on to occur. But of course there can be some other code that I changed that has side effects. Anyway, regardless of the cause of the off/on, I guess the test needs to handle the event if it occurs, do you agree?

lesn1kk commented 7 years ago

So this was happening because we manually tried to connect to the peer that was not available anymore. This was caused by inappropriate removing unavailable peers, so we tried to connect to the wrong peer in infinite loop. We do not listen to nonTCPPeerAvailabilityChangedEvent but to peerAvailabilityChanged directly, so I don't think we should handle it, since it contains same information. After fixing proper removing from availablePeers array, issue doesn't seem to occur anymore. In the end, the error thrown by ThaliCore was confusing, so it is good that this issue was created.

enricogior commented 7 years ago

@mlesnic there is a different scenario where nonTCPPeerAvailabilityChangedEvent is received by the peer that is trying to connect but the peerAvailabilityChanged event is not emitted. It happens when there is an error on the Advertiser side (a legit advertiser, not a zombie). When the error occurs, the Advertiser closes the current activeRelay and the other peer fails to connect to it. After that the nonTCPPeerAvailabilityChangedEvent is fired first with "peerAvailable":false and then with "peerAvailable":true. But given that the peerAvailabilityChanged event is not emitted, the peer will not retry to connect to the Advertiser since it doesn't know that it is available again. It's a bug that I reproduce quite often. It seems like it should be fixed in thaliMobileNativeWrapper.js that but I'm not sure about it, for example I don't know if it is correct that the states value is supposed to be "non-started" when that event occurs. Maybe @yaronyg can give us some directions.

yaronyg commented 7 years ago

The most important question, I think, is why are we expecting to get peerAvailabilityChanged events when states.started === false?

If states.started === false this either means:

We never called start and so none of the native radios should be on or
We called stop and turned all the radios off as executed here

In either case, we shouldn't be expecting any more nonTCPPeerAvailabilityChangedEvents and hence no peerAvailabilityChanged events (at least from native sources). So isn't the bug that someone turned off the radios and is still expecting to hear about peer availability?

enricogior commented 7 years ago

@yaronyg this is the updated test to be zombies proof, but the problem exists with the original test as well.

Let's look at the original test for simplicity. The test calls startAndListen() and it will not stop listening until the teardown is called.

So to answer your questions:

We called start. [EDIT: I was wrong, the tests in testThaliMobileNative.js don't call start]
We haven't called stop yet.

I would expect that states.started is true throughout the test until teardown is called. But I'm not super familiar with the node layer, so maybe I'm missing something. Bottom line: in the test we want to keep receiving peerAvailabilityChanged events since we haven't established a connection yet, but what I see in the log is that an error occurs while connecting to a valid peer (the error seems to be on the advertiser side and caused by MPCF), after the error the advertiser closes the advertiserRelay effected by that error and then MPCF emits two events that are received by the theliMobileNativeWrapper but will not be received by the test:

2017-06-14 13:42:29 - DEBUG thaliMobileNativeWrapper: 'Received peer availability changed event with {"peerIdentifier":"C4DA215A-9F31-4CB5-A8CC-7F0F7003166D","peerAvailable":false,"portNumber":null,"recreated":true}'
2017-06-14 13:42:29 - DEBUG thaliMobileNativeWrapper: 'Filtered out nonTCPPeerAvailabilityChangedEvent due to not being in started state'
2017-06-14 13:42:29 - DEBUG thaliMobileNativeWrapper: 'Received peer availability changed event with {"peerIdentifier":"C4DA215A-9F31-4CB5-A8CC-7F0F7003166D","peerAvailable":true,"portNumber":null,"recreated":true}'
2017-06-14 13:42:29 - DEBUG thaliMobileNativeWrapper: 'Filtered out nonTCPPeerAvailabilityChangedEvent due to not being in started state'

I'm going to investigate why the states.started is false at that point, given that, if I understood correctly, it should not be false, right?

EDIT: so I was wrong thinking that the states.started is expected to be true, the test should receive that notification through this registration. I'll keep investigating why in one particular case it doesn't seem to reach the test.

enricogior commented 7 years ago

The logs of the case when the advertiser on device 2 for some internal MPCF error is reset. Device 1 receives two nonTCPPeerAvailabilityChangedEvent notifications that don't reach the test layer.

https://gist.github.com/enricogior/efc9d64c704dcaf905bfdfc5ba8acf45 https://gist.github.com/enricogior/740e6d8da2b1dd491fae81a7ffcba83d

enricogior commented 7 years ago

I was misled by these events:

2017-06-14 13:42:29 - DEBUG thaliMobileNativeWrapper: 'Received peer availability changed event with {"peerIdentifier":"C4DA215A-9F31-4CB5-A8CC-7F0F7003166D","peerAvailable":false,"portNumber":null,"recreated":true}'
2017-06-14 13:42:29 - DEBUG thaliMobileNativeWrapper: 'Filtered out nonTCPPeerAvailabilityChangedEvent due to not being in started state'
2017-06-14 13:42:29 - DEBUG thaliMobileNativeWrapper: 'Received peer availability changed event with {"peerIdentifier":"C4DA215A-9F31-4CB5-A8CC-7F0F7003166D","peerAvailable":true,"portNumber":null,"recreated":true}'
2017-06-14 13:42:29 - DEBUG thaliMobileNativeWrapper: 'Filtered out nonTCPPeerAvailabilityChangedEvent due to not being in started state'

The problem with those events is that they are bogus events. We will have to look into that but they are not currently a concern in regard to the fixes for the iOS Native layer.

When the connection to a valid peer fails during the initial handshaking, we should not remove that peer from the available peers since it's still an active advertiser that we can connect to.

In the test we remove the peer and therefore, if there aren't other available peers, the test will fail with a timeout error.

The trick here is to keep the peer and move it to the end of the list of available peers. In that way, if the peer is a zombie advertiser it will be eventually removed by peerAvailabilityChangedHandler(), but if it's a valid peer we will retry connecting to it if it's the only available peer.

A simple fix is just to add availablePeers.push(peer); right after removeFromAvailablePeers(peer); and before calling tryToConnect();

thaliproject / Thali_CordovaPlugin

Testing Thali-iOS: BrowserManager throws illegalPeerID #1904