project-chip / connectedhomeip

Matter (formerly Project CHIP) creates more connections between more objects, simplifying development for manufacturers and increasing compatibility for consumers, guided by the Connectivity Standards Alliance.
https://buildwithmatter.com
Apache License 2.0
7.33k stars 1.97k forks source link

[Platform] [Android] FindOperational: mDNS resolving of Thread devices incorrect timeouts #31133

Open caipiblack opened 8 months ago

caipiblack commented 8 months ago

Reproduction steps

Stack commit: 221e466510600543fee1d9fb70f270022e37866e

We found an issue with the mDNS resolver implementation of the Android platform. At the FindOperational step.

Description:

During the commissioning of a Thread device, when the device receive the ConnectNetwork command, it initiate the connection to the Thread network. For example:

void GenericThreadDriver::ConnectNetwork(ByteSpan networkId, ConnectCallback * callback)
{
    NetworkCommissioning::Status status = MatchesNetworkId(mStagingNetwork, networkId);

    if (status == Status::kSuccess && BackupConfiguration() != CHIP_NO_ERROR)
    {
        status = Status::kUnknownError;
    }

    if (status == Status::kSuccess &&
        DeviceLayer::ThreadStackMgrImpl().AttachToThreadNetwork(mStagingNetwork, callback) != CHIP_NO_ERROR)
    {
        status = Status::kUnknownError;
    }

    if (status != Status::kSuccess)
    {
        callback->OnResult(status, CharSpan(), 0);
    }
}

The function AttachToThreadNetwork() is like this:

template <class ImplClass>
CHIP_ERROR GenericThreadStackManagerImpl_OpenThread<ImplClass>::_AttachToThreadNetwork(
    const Thread::OperationalDataset & dataset, NetworkCommissioning::Internal::WirelessDriver::ConnectCallback * callback)
{
    // Reset the previously set callback since it will never be called in case incorrect dataset was supplied.
    mpConnectCallback = nullptr;
    ReturnErrorOnFailure(Impl()->SetThreadEnabled(false));
    ReturnErrorOnFailure(Impl()->SetThreadProvision(dataset.AsByteSpan()));

    if (dataset.IsCommissioned())
    {
        ReturnErrorOnFailure(Impl()->SetThreadEnabled(true));
        mpConnectCallback = callback;
    }

    return CHIP_NO_ERROR;
}

The function just configure the thread network, enable openthread and return

So first, there is no guaranties that immediately after this step the device is accessible. In our cases it's never the case. Also it's possible that the device connect to "a new thread partition" for some reason (like he don't find any thread partitions.. (RF environment etc..)).

The problem:

In our system we never find the device on the first discovery, because the device is still attaching to the thread network of the border router and has not yet published the services with SRP/mDNS

Notes:

We are testing with a "EVE Socket" but also an esp32-c6 "light example" (in thread) from esp-matter.

Logs

12-20 13:07:23.960 28968 29058 D NsdManagerServiceResolver: resolve: Starting service resolution for '8BC2DB69281E8303-0000000000000055' type '_matter._tcp'
12-20 13:07:24.213 28968 29921 D NsdServiceFinderAndResolver: Service discovery started. regType: _matter._tcp
12-20 13:07:28.963 28968 30359 D NsdServiceFinderAndResolver: Service discovery timed out after 5000 ms
12-20 13:08:08.961 28968 29925 D DIS     : Checking node lookup status after 45002 ms
12-20 13:08:08.962 28968 29925 E DIS     : OperationalSessionSetup[1:0000000000000055]: operational discovery failed: src/lib/address_resolve/AddressResolve_DefaultImpl.cpp:114: CHIP Error 0x00000032: Timeout
12-20 13:08:08.962 28968 29925 D NsdManagerServiceResolver: resolve: Starting service resolution for '8BC2DB69281E8303-0000000000000055' type '_matter._tcp'
12-20 13:08:09.469 28968 29921 I NsdServiceFinderAndResolver: Discovery stopped: _matter._tcp
12-20 13:08:09.491 28968 29921 I NsdServiceFinderAndResolver: Resolved service '8BC2DB69281E8303-0000000000000055' to /fd77:b6dd:ef9c:1:b7c9:b908:ba63:4a84, type : ._matter._tcp
12-20 13:08:09.510 28968 29921 D DIS     :  ----- entry [0] : T 0
12-20 13:08:09.510 28968 29921 D DIS     :  ----- entry [1] : SAI 300
12-20 13:08:09.510 28968 29921 D DIS     :  ----- entry [2] : SII 5000
12-20 13:08:09.511 28968 29921 D DIS     : Node ID resolved for 8BC2DB69281E8303:0000000000000055
12-20 13:08:09.511 28968 29921 D DIS     :  Hostname: fd77:b6dd:ef9c:1
12-20 13:08:09.511 28968 29921 D DIS     :  IP Address #1: fd77:b6dd:ef9c:1:b7c9:b908:ba63:4a84
12-20 13:08:09.511 28968 29921 D DIS     :  Port: 5540
12-20 13:08:09.511 28968 29921 D DIS     :  Mrp Interval idle: 5000 ms
12-20 13:08:09.511 28968 29921 D DIS     :  Mrp Interval active: 300 ms
12-20 13:08:09.511 28968 29921 D DIS     :  TCP Supported: 0
12-20 13:08:09.513 28968 29921 D DIS     : Lookup clearing interface for non LL address
12-20 13:08:09.513 28968 29921 D DIS     : UDP:[fd77:b6dd:ef9c:1:b7c9:b908:ba63:4a84]:5540: new best score: 4
12-20 13:08:09.513 28968 29921 D DIS     : Checking node lookup status after 551 ms
12-20 13:08:09.513 28968 29921 D DIS     : OperationalSessionSetup[1:0000000000000055]: Updating device address to UDP:[fd77:b6dd:ef9c:1:b7c9:b908:ba63:4a84]:5540 while in state 2
12-20 13:08:09.513 28968 29921 D DIS     : OperationalSessionSetup[1:0000000000000055]: State change 2 --> 3
12-20 13:08:09.514 28968 29921 D IN      : SecureSession[0xb400006fca49ab78]: Allocated Type:2 LSID:40843
12-20 13:08:09.514 28968 29921 D SC      : Initiating session on local FabricIndex 1 from 0x000000000001B669 -> 0x0000000000000055

Platform

android

Platform Version(s)

Android 13 (no importance)

Type

Platform validated

(Optional) If manually tested please explain why this is only manually tested

No response

Anything else?

No response

caipiblack commented 8 months ago

Is it wanted that the stack search for devices during 5s with mDNS, then wait 40s to retry ?

Why not searching during 45s (actively during some seconds, then passively) and retry in case of failure ..?

Is it android implementation specific? Limitation from android API ?

Here there is an exemple where the device is not discoverable “within 5s”:

https://github.com/orgs/openthread/discussions/9727#discussioncomment-7958203