openthread / ot-br-posix

OpenThread Border Router, a Thread border router for POSIX-based platforms.
https://openthread.io/
BSD 3-Clause "New" or "Revised" License
420 stars 235 forks source link

In src/agent/dtls_mbedtls.cpp - issue with FD_ISSET() #87

Closed DuaneEllis-TI closed 7 years ago

DuaneEllis-TI commented 7 years ago

To start with, I'm not a UDP socket guy - but I think I found a problem dealing with the dtls server in the border router agent, and the commissioning test app

My initial tests (with a fix) seem to be indicate I am correct, but I have not completed my testing.

Step 1: In the main loop, various 'pseudo' tasks are called to set their bits in a select( ... fd_set ... )

This occurs here, in the commissioner test app: https://github.com/openthread/borderrouter/blob/master/tests/meshcop/commissioner.cpp#L655

In the AGENT, it occurs here: https://github.com/openthread/borderrouter/blob/master/src/agent/main.cpp#L68

Step 2: The main loop then calls select, either select times out, or something is readable/writable, etc.

Step 3: The main loop then calls all of the 'pseudo-tasks' - asking them to examine their SelectBits.

I think I see a problem in in the MbedtlsServer::ProcessServer() function, the code does this:

SEE: https://github.com/openthread/borderrouter/blob/master/src/agent/dtls_mbedtls.cpp#L491 snippit below

void MbedtlsServer::ProcessServer( ... )
{
     /* .. snip ... */
     otbrError   error = OTBR_ERROR_ERRNO;
     /* .. snip ... */

    // I THINK THIS IS WRONG...
    VerifyOrExit(recvmsg(mSocket, &msghdr, MSG_PEEK) > 0);

    [ snip ... process the request ...]

exit:
    // abbreviated version...
    if( error ){
           close_socket()
           reopen_socket()
     }
}

I think this is wrong because:

SHORT VERSION

Step A - During the JOINING process, the JOINER sends a RELAY request that is eventually will end up back at the Commissioner

Step B - To do that, when the RELAY request is received, it is decrypted and forwarded -

example in the commissioner test application, here: https://github.com/openthread/borderrouter/blob/master/tests/meshcop/commissioner.cpp#L212

In the agent, it occurs here (tx & rx are next to each other): https://github.com/openthread/borderrouter/blob/master/src/agent/border_agent.cpp#L141

Step C - KEY POINT Depending upon timing, that message may or may-not be ready on the relay socket when select is called, due to this issue, the socket is closed & reopened thus the message may or may not be lost. (without a fix, I see the message lost quite often, but it is moody sometimes it works, sometimes it fails always ... Grr...)

LONG VERSION

FYI: @pkanek and @srickardti

DuaneEllis-TI commented 7 years ago

Update: After quite a bit of testing (30+ join attempts) - the ONLY type of JOIN_FAIL I have found is when the JOINING_ROUTER does not receive the initial discovery request.

What I cannot do right now is test with the ThreadGroup Android app because that app does not function in my home network... why I don't know... The Hurricane Electric Android App, found here:

https://play.google.com/store/apps/details?id=net.he.networktools&hl=en

The HE app can see the Bonjour entry for the Border Router no problems. But the ThreadGroup Android App - cannot see the mDNS entry. I believe this is a bug in the ThreadGroup Android app mDNS system. This only occurs in my home office environment. This is logged in the ThreadGroup Android App JIRA here: https://threadgroup.atlassian.net/browse/COMMAPP-141

Further testing will occur when I am in the office, but now commissioning with the test app is extremely reliable now :+1:

bukepo commented 7 years ago

@DuaneEllis-TI thanks for identifying this issue. After a DTLS session is established, the socket used for listening new DTLS requests is different from the socket bound with the existing session, so re-opening the listening socket should not affect existing sessions. However, the listening socket should not be re-open when it's not readable. Are you going to share your fix with a PR?

DuaneEllis-TI commented 7 years ago

yes, I hope to push a PR very soon.

DuaneEllis-TI commented 7 years ago

See: https://github.com/openthread/borderrouter/pull/90