[rest] /diagnostics service returns partial results, purges any diagnostics data older than 3 seconds.

ekobres commented 1 year ago

Describe the bug The Topology view in the Web GUI frequently only loads partial router data, showing an incomplete graph, even in a well-functioning network.

Inspection of the underlying /diagnostics service shows that there are 2 timeouts associated with returning and cleaning diagnostics information:

https://github.com/openthread/ot-br-posix/blob/172dcc145fd662533f7c8b8576cfc5a0c3fe18ca/src/rest/resource.cpp#L78-L82

As I read it, this means a /diagnostics API request will terminate after 2 seconds (fine), and at the callback, any diagnostics data older than 3 seconds will be purged from the list (maybe not fine.)

Perhaps the design intent was for 20 seconds and 30 seconds? No notes or docs to clarify - but 2 & 3 seconds seems very optimistic.

Even a networkdiagnostic get ff02::1 0 1 6 call takes 6 seconds to complete on my system.

The actual behavior on my network with 21 nodes results in the Topology map showing a subset of actual network nodes, with as few as zero and as many as 13.

Looking at the code, and at the behavior, it seems that there's no way any information older than 3 seconds to make it back from a call to /Diagnostics.

To Reproduce

Run /diagnostics API on any thread network that routinely takes longer than 3 seconds to complete:

networkdiagnostic get ff02::1 0 1 6

run: curl http://<your-otbr-instance>:8081/diagnostics

Inspect the ExtAddress entities and note that there are missing items.

or:

Open the topology viewer in the Web GUI and note whether all network nodes are displayed.

Git commit id: Any - source is the same since initial check-in.
IEEE 802.15.4 hardware platform: SI Labs EFR32 (HA SkyConnect dongle)
Build steps: ARM64
Network topology: Single OTBR on Raspberry Pi 4b + 20 Nanoleaf A9 smart bulbs.

Expected behavior

/Diagnostics should enumerate available information to match the true network topology, instead it times out very quickly and returns partial results on anything larger than a trivially small network.

Console/log output

Node API information - note leader and number of routers.

curl http://x.x.x.x:8081/node

{
    "State":    2,
    "NumOfRouter":    21,
    "RlocAddress":    "fd7d:xxxx:xxxx:xxxx:0:ff:fe00:3c18",
    "ExtAddress":    "52954E48XXXXXXXX",
    "NetworkName":    "home-assistant",
    "Rloc16":    15384,
    "LeaderData":    {
        "PartitionId":    1937964234,
        "Weighting":    64,
        "DataVersion":    251,
        "StableDataVersion":    182,
        "LeaderRouterId":    18
    },
    "ExtPanId":    "A683XXXXXXXXXXXX"
}

Topology from same network:

Screen Shot 2023-03-01 at 10 37 49 AM

Additional context

This is presumably only a real problem for the OTBR Web GUI Topology UI, as there is no published documentation for the REST API.

Recommended solution - test larger default values (e.g. 30 second kDiagResetTimeout timeout and 20 second `kDiagCollectTimeout' and consider adding timeout parameters to the API so apps can determine the response time and data freshness.

Alternative solution - provide configuration parameters for the OTBR Web GUI to adjust these timeouts.

wgtdkp commented 1 year ago

@ekobres Thanks for reporting this issue! I agree it's too short to collect diagnostic info in 3 seconds. Enlarge kDiagCollectTimeout to 20 seconds will result in a /diagnostics taking 20 seconds to respond, which is probably not acceptable in general use cases.

Adding an argument to the API to specify the timeouts sounds like a reasonable solution.
The best solution is probably use persistent HTTP connection to stream the diagnostic info back to the client, but it requires significantly more efforts to support in OTBR Web.

Will you be able to contribute the option 1 or 2?

ekobres commented 1 year ago

Correct me if I am wrong, but the timeout is only the maximum time. If diagnostics reporting is completed in less time, then the call will complete sooner.Obviously 20 seconds seems too long for a healthy mesh, but it would return much faster normally.Also, hitting timeout could return a timeout error with partial data so the UI can hint diagnostics are slow.

I would be happy to contribute but I have never built OTBR. But let me play around with first getting an OTBR environment set up with nRF52840 dongle - I have one of those and an RPi3b+, so maybe I can figure it out.

wgtdkp commented 1 year ago

Correct me if I am wrong, but the timeout is only the maximum time. If diagnostics reporting is completed in less time, then the call will complete sooner

The diagnostic request is multicasted to all router devices in the mesh, so there will be multiple unicast responses and we have no machenism to determine if we have received responses from all routers since we don't know how many devices are there. So OTBR has to wait for the timeout to try to receive all those responses.

ekobres commented 1 year ago

Maybe I am misunderstanding something then, because it seems the /diagnostics API is getting the number of routers from somewhere. The leader knows the roles of every node in the mesh, especially the routers. Nevertheless - the number of routers is right there in the diagnostics JSON:

"Connectivity": {
            "ParentPriority":   0,
            "LinkQuality3": 1,
            "LinkQuality2": 2,
            "LinkQuality1": 1,
            "LeaderCost":   1,
            "IdSequence":   115,
            "ActiveRouters":    20,
            "SedBufferSize":    1280,
            "SedDatagramCount": 1
        }

abtink commented 1 year ago

The recently added mesh-diag APIs and CLI commands can help here:

https://github.com/openthread/openthread/pull/8682.

This adds new APIs which use the underlying net-diag TMF commands to make it easier to discover topology. https://github.com/openthread/openthread/issues/8460 tracks new features (related PRs).

ekobres commented 1 year ago

I have forked and tested some new values which provide a more reliable experience with the Topology page. There is a fair amount of work that could be done to improve this - but with new timeout values we at least have a GUI that can capture all of the routers in a non-trivial thread mesh fairly consistently.

I settled on 120 seconds for the kDiagResetTimeout and 10 seconds for the kDiagCollectTimeout. With these values I am able to get all of my mesh with 20 routers to populate with one or two reloads. Previously I was never able to get more than 11 routers to populate.

wgtdkp commented 1 year ago

@ekobres I missed the ActiveRouters! But we still need to wait for time of kDiagCollectTimeout before receiving any responses.

wgtdkp commented 1 year ago

@abtink Yes the new APIs should be more useful, but it probably doesn't help this issue if a RESTful API is required (@ekobres you may want to try Abtin's API if RESTful API isn't mandatory).

ekobres commented 1 year ago

The recently added mesh-diag APIs and CLI commands can help here:

@wgtdkp Wow. It's fast, too. Thanks for pointing this out!

philipflesher commented 2 months ago

@ekobres coming around on this issue because I have realized the existing web UI topology tool is giving me the same problems as originally described here, with sometimes terribly disconnected graphs.

Is it possible we can use the "new" mesh-diag path in the web UI, such that the topology view would hit a new endpoint similar to /diagnostics, which would call the new path and then construct an appropriate graph?

jwhui commented 2 months ago

Is it possible we can use the "new" mesh-diag path in the web UI, such that the topology view would hit a new endpoint similar to /diagnostics, which would call the new path and then construct an appropriate graph?

It should be possible to leverage the newer Thread Diagnostics capabilities. Contributions are welcome! :D

philipflesher commented 2 months ago

I wish I could contribute on this, but sadly do not have a OTBR device to deploy code to and test against. :(

Is there any doc on getting a full dev environment set up with a minimal (hopefully inexpensive) device?

jwhui commented 2 months ago

@philipflesher , you can try the OTBR Codelab, which builds on Raspberry Pi

https://openthread.io/codelabs/openthread-border-router

philipflesher commented 2 months ago

Looks straightforward. I might get on this.

Realizing if I'm going to make changes, however, that I would need to simulate at least a medium-sized network, probably with some delays and failures. Is there any dev path for creating simulated networks that include routers and end devices?

openthread / ot-br-posix

[rest] /diagnostics service returns partial results, purges any diagnostics data older than 3 seconds. #1773