Open lesn1kk opened 7 years ago
2017-06-12 12:29:15 - DEBUG thaliMobileNativeWrapper: 'Received peer availability changed event with {"peerIdentifier":"4FC24DF5-68D0-441E-89E1-BA33F8B36327","peerAvailable":false,"portNumber":null,"recreated":true}'
2017-06-12 12:29:15 - DEBUG thaliMobileNativeWrapper: 'Filtered out nonTCPPeerAvailabilityChangedEvent due to not being in started state'
2017-06-12 12:29:15 - DEBUG thaliMobileNativeWrapper: 'Received peer availability changed event with {"peerIdentifier":"4FC24DF5-68D0-441E-89E1-BA33F8B36327","peerAvailable":true,"portNumber":null,"recreated":true}'
2017-06-12 12:29:15 - DEBUG thaliMobileNativeWrapper: 'Filtered out nonTCPPeerAvailabilityChangedEvent due to not being in started state'
2017-06-12 12:29:15 - DEBUG thaliMobileNativeTestUtils: 'Issuing multiConnect for 4FC24DF5-68D0-441E-89E1-BA33F8B36327 (syncValue: lSZ62wRvisNhbA5Qcw8QxIeuLyz5SQ0B)'
2017-06-12 12:29:15 - DEBUG thaliMobileNativeTestUtils: 'Got 'multiConnect' callback'
[ThaliCore] BrowserManager.connectToPeer(_:syncValue:completion:) peer:4FC24DF5-68D0-441E-89E1-BA33F8B36327
[ThaliCore] Browser.inviteToConnect(_:sessionConnected:sessionNotConnected:) Peer(uuid: "4FC24DF5-68D0-441E-89E1-BA33F8B36327", generation: 0)
[ThaliCore] BrowserManager.connectToPeer(_:syncValue:completion:) error: illegalPeerID
I suspect the code that throws the exception is this: https://github.com/thaliproject/thali-ios/blob/e5c34a5aeaa442ce0ebdb20102aa2eb404c01437/ThaliCore/MultipeerConnectivity/Browser.swift#L154 The error code is confusing, since in this case it's a valid peer ID, just one not found in the list of available peers. I'll open an issue to change the error description.
The reason why the peer is not anymore in the list of available peers, seems to be that we received an event for peer 4FC24DF5-68D0-441E-89E1-BA33F8B36327
with an available
status set to false followed by another event with the status set to true.
It could be that the second event has not been correctly handled.
Events for peer 4FC24DF5-68D0-441E-89E1-BA33F8B36327
with available: false
and available: true
one after another come from thaliMobileNativeWrapper
. When the error Could not establish connections
occur, the peer is recreated with peerAvailable
set to false, and right after that with true, so it would trigger listeners on nonTCPPeerAvailabilityChangedEvent
.
Since we are listening for peerAvailabilityChanged
events, we are not reacting at all, so we wont try to reconnect. I need to check if this error is caused by latest tests changes or by ThaliCore changes.
If nonTCPPeerAvailabilityChangedEvent turning off and on is producing consistently it may be caused by one of our components, but it might also be something random that doesn't depend on ThaliCore or the tests. So far I haven't changed anything in ThaliCore that explicitly can cause that off/on to occur. But of course there can be some other code that I changed that has side effects. Anyway, regardless of the cause of the off/on, I guess the test needs to handle the event if it occurs, do you agree?
So this was happening because we manually tried to connect to the peer that was not available anymore. This was caused by inappropriate removing unavailable peers, so we tried to connect to the wrong peer in infinite loop. We do not listen to nonTCPPeerAvailabilityChangedEvent
but to peerAvailabilityChanged
directly, so I don't think we should handle it, since it contains same information.
After fixing proper removing from availablePeers
array, issue doesn't seem to occur anymore.
In the end, the error thrown by ThaliCore
was confusing, so it is good that this issue was created.
@mlesnic there is a different scenario where nonTCPPeerAvailabilityChangedEvent
is received by the peer that is trying to connect but the peerAvailabilityChanged
event is not emitted.
It happens when there is an error on the Advertiser side (a legit advertiser, not a zombie). When the error occurs, the Advertiser closes the current activeRelay
and the other peer fails to connect to it.
After that the nonTCPPeerAvailabilityChangedEvent
is fired first with "peerAvailable":false
and then with "peerAvailable":true
.
But given that the peerAvailabilityChanged
event is not emitted, the peer will not retry to connect to the Advertiser since it doesn't know that it is available again.
It's a bug that I reproduce quite often.
It seems like it should be fixed in thaliMobileNativeWrapper.js that but I'm not sure about it, for example I don't know if it is correct that the states
value is supposed to be "non-started" when that event occurs.
Maybe @yaronyg can give us some directions.
The most important question, I think, is why are we expecting to get peerAvailabilityChanged events when states.started === false?
If states.started === false this either means:
In either case, we shouldn't be expecting any more nonTCPPeerAvailabilityChangedEvents and hence no peerAvailabilityChanged events (at least from native sources). So isn't the bug that someone turned off the radios and is still expecting to hear about peer availability?
@yaronyg this is the updated test to be zombies proof, but the problem exists with the original test as well.
Let's look at the original test for simplicity.
The test calls startAndListen()
and it will not stop listening until the teardown is called.
So to answer your questions:
I would expect that states.started
is true throughout the test until teardown
is called.
But I'm not super familiar with the node layer, so maybe I'm missing something.
Bottom line: in the test we want to keep receiving peerAvailabilityChanged
events since we haven't established a connection yet, but what I see in the log is that an error occurs while connecting to a valid peer (the error seems to be on the advertiser side and caused by MPCF), after the error the advertiser closes the advertiserRelay effected by that error and then MPCF emits two events that are received by the theliMobileNativeWrapper
but will not be received by the test:
2017-06-14 13:42:29 - DEBUG thaliMobileNativeWrapper: 'Received peer availability changed event with {"peerIdentifier":"C4DA215A-9F31-4CB5-A8CC-7F0F7003166D","peerAvailable":false,"portNumber":null,"recreated":true}'
2017-06-14 13:42:29 - DEBUG thaliMobileNativeWrapper: 'Filtered out nonTCPPeerAvailabilityChangedEvent due to not being in started state'
2017-06-14 13:42:29 - DEBUG thaliMobileNativeWrapper: 'Received peer availability changed event with {"peerIdentifier":"C4DA215A-9F31-4CB5-A8CC-7F0F7003166D","peerAvailable":true,"portNumber":null,"recreated":true}'
2017-06-14 13:42:29 - DEBUG thaliMobileNativeWrapper: 'Filtered out nonTCPPeerAvailabilityChangedEvent due to not being in started state'
I'm going to investigate why the states.started
is false at that point, given that, if I understood correctly, it should not be false, right?
EDIT: so I was wrong thinking that the states.started
is expected to be true, the test should receive that notification through this registration. I'll keep investigating why in one particular case it doesn't seem to reach the test.
The logs of the case when the advertiser on device 2 for some internal MPCF error is reset.
Device 1 receives two nonTCPPeerAvailabilityChangedEvent
notifications that don't reach the test layer.
https://gist.github.com/enricogior/efc9d64c704dcaf905bfdfc5ba8acf45 https://gist.github.com/enricogior/740e6d8da2b1dd491fae81a7ffcba83d
I was misled by these events:
2017-06-14 13:42:29 - DEBUG thaliMobileNativeWrapper: 'Received peer availability changed event with {"peerIdentifier":"C4DA215A-9F31-4CB5-A8CC-7F0F7003166D","peerAvailable":false,"portNumber":null,"recreated":true}'
2017-06-14 13:42:29 - DEBUG thaliMobileNativeWrapper: 'Filtered out nonTCPPeerAvailabilityChangedEvent due to not being in started state'
2017-06-14 13:42:29 - DEBUG thaliMobileNativeWrapper: 'Received peer availability changed event with {"peerIdentifier":"C4DA215A-9F31-4CB5-A8CC-7F0F7003166D","peerAvailable":true,"portNumber":null,"recreated":true}'
2017-06-14 13:42:29 - DEBUG thaliMobileNativeWrapper: 'Filtered out nonTCPPeerAvailabilityChangedEvent due to not being in started state'
The problem with those events is that they are bogus events. We will have to look into that but they are not currently a concern in regard to the fixes for the iOS Native layer.
When the connection to a valid peer fails during the initial handshaking, we should not remove that peer from the available peers since it's still an active advertiser that we can connect to.
In the test we remove the peer and therefore, if there aren't other available peers, the test will fail with a timeout error.
The trick here is to keep the peer and move it to the end of the list of available peers.
In that way, if the peer is a zombie advertiser it will be eventually removed by peerAvailabilityChangedHandler()
, but if it's a valid peer we will retry connecting to it if it's the only available peer.
A simple fix is just to add availablePeers.push(peer);
right after removeFromAvailablePeers(peer);
and before calling tryToConnect();
Tried to run multiple times in a row
Can shift data via parallel connections
test and at some point it was stuck like in infinite loop, becausebrowserManager
throwsillegalPeerID
event that triggers connect retry logic, which tries to connect again and so on. From logs we can see that it looks like it is proper. Logs