Closed DuaneEllis-TI closed 7 years ago
Update: After quite a bit of testing (30+ join attempts) - the ONLY type of JOIN_FAIL I have found is when the JOINING_ROUTER does not receive the initial discovery request.
What I cannot do right now is test with the ThreadGroup Android app because that app does not function in my home network... why I don't know... The Hurricane Electric Android App, found here:
https://play.google.com/store/apps/details?id=net.he.networktools&hl=en
The HE app can see the Bonjour entry for the Border Router no problems. But the ThreadGroup Android App - cannot see the mDNS entry. I believe this is a bug in the ThreadGroup Android app mDNS system. This only occurs in my home office environment. This is logged in the ThreadGroup Android App JIRA here: https://threadgroup.atlassian.net/browse/COMMAPP-141
Further testing will occur when I am in the office, but now commissioning with the test app is extremely reliable now :+1:
@DuaneEllis-TI thanks for identifying this issue. After a DTLS session is established, the socket used for listening new DTLS requests is different from the socket bound with the existing session, so re-opening the listening socket should not affect existing sessions. However, the listening socket should not be re-open when it's not readable. Are you going to share your fix with a PR?
yes, I hope to push a PR very soon.
To start with, I'm not a UDP socket guy - but I think I found a problem dealing with the dtls server in the border router agent, and the commissioning test app
My initial tests (with a fix) seem to be indicate I am correct, but I have not completed my testing.
Step 1: In the main loop, various 'pseudo' tasks are called to set their bits in a select( ... fd_set ... )
This occurs here, in the commissioner test app: https://github.com/openthread/borderrouter/blob/master/tests/meshcop/commissioner.cpp#L655
In the AGENT, it occurs here: https://github.com/openthread/borderrouter/blob/master/src/agent/main.cpp#L68
Step 2: The main loop then calls select, either select times out, or something is readable/writable, etc.
Step 3: The main loop then calls all of the 'pseudo-tasks' - asking them to examine their SelectBits.
I think I see a problem in in the
MbedtlsServer::ProcessServer()
function, the code does this:SEE: https://github.com/openthread/borderrouter/blob/master/src/agent/dtls_mbedtls.cpp#L491 snippit below
I think this is wrong because:
SHORT VERSION
Step A - During the JOINING process, the JOINER sends a RELAY request that is eventually will end up back at the Commissioner
Step B - To do that, when the RELAY request is received, it is decrypted and forwarded -
example in the commissioner test application, here: https://github.com/openthread/borderrouter/blob/master/tests/meshcop/commissioner.cpp#L212
In the agent, it occurs here (tx & rx are next to each other): https://github.com/openthread/borderrouter/blob/master/src/agent/border_agent.cpp#L141
Step C - KEY POINT Depending upon timing, that message may or may-not be ready on the relay socket when select is called, due to this issue, the socket is closed & reopened thus the message may or may not be lost. (without a fix, I see the message lost quite often, but it is moody sometimes it works, sometimes it fails always ... Grr...)
LONG VERSION
exit:
label, if an ERROR occurs - the socket is CLOSEDFYI: @pkanek and @srickardti