Open ekobres opened 1 year ago
@ekobres Thanks for reporting this issue! I agree it's too short to collect diagnostic info in 3 seconds. Enlarge kDiagCollectTimeout
to 20 seconds will result in a /diagnostics taking 20 seconds to respond, which is probably not acceptable in general use cases.
Will you be able to contribute the option 1 or 2?
Correct me if I am wrong, but the timeout is only the maximum time. If diagnostics reporting is completed in less time, then the call will complete sooner.Obviously 20 seconds seems too long for a healthy mesh, but it would return much faster normally.Also, hitting timeout could return a timeout error with partial data so the UI can hint diagnostics are slow.
I would be happy to contribute but I have never built OTBR. But let me play around with first getting an OTBR environment set up with nRF52840 dongle - I have one of those and an RPi3b+, so maybe I can figure it out.
Correct me if I am wrong, but the timeout is only the maximum time. If diagnostics reporting is completed in less time, then the call will complete sooner
The diagnostic request is multicasted to all router devices in the mesh, so there will be multiple unicast responses and we have no machenism to determine if we have received responses from all routers since we don't know how many devices are there. So OTBR has to wait for the timeout to try to receive all those responses.
Maybe I am misunderstanding something then, because it seems the /diagnostics API is getting the number of routers from somewhere. The leader knows the roles of every node in the mesh, especially the routers. Nevertheless - the number of routers is right there in the diagnostics JSON:
"Connectivity": {
"ParentPriority": 0,
"LinkQuality3": 1,
"LinkQuality2": 2,
"LinkQuality1": 1,
"LeaderCost": 1,
"IdSequence": 115,
"ActiveRouters": 20,
"SedBufferSize": 1280,
"SedDatagramCount": 1
}
The recently added mesh-diag
APIs and CLI commands can help here:
This adds new APIs which use the underlying net-diag TMF commands to make it easier to discover topology. https://github.com/openthread/openthread/issues/8460 tracks new features (related PRs).
I have forked and tested some new values which provide a more reliable experience with the Topology page. There is a fair amount of work that could be done to improve this - but with new timeout values we at least have a GUI that can capture all of the routers in a non-trivial thread mesh fairly consistently.
I settled on 120 seconds for the kDiagResetTimeout and 10 seconds for the kDiagCollectTimeout. With these values I am able to get all of my mesh with 20 routers to populate with one or two reloads. Previously I was never able to get more than 11 routers to populate.
@ekobres I missed the ActiveRouters
! But we still need to wait for time of kDiagCollectTimeout
before receiving any responses.
@abtink Yes the new APIs should be more useful, but it probably doesn't help this issue if a RESTful API is required (@ekobres you may want to try Abtin's API if RESTful API isn't mandatory).
The recently added
mesh-diag
APIs and CLI commands can help here:
@wgtdkp Wow. It's fast, too. Thanks for pointing this out!
@ekobres coming around on this issue because I have realized the existing web UI topology tool is giving me the same problems as originally described here, with sometimes terribly disconnected graphs.
Is it possible we can use the "new" mesh-diag path in the web UI, such that the topology view would hit a new endpoint similar to /diagnostics
, which would call the new path and then construct an appropriate graph?
Is it possible we can use the "new" mesh-diag path in the web UI, such that the topology view would hit a new endpoint similar to /diagnostics, which would call the new path and then construct an appropriate graph?
It should be possible to leverage the newer Thread Diagnostics capabilities. Contributions are welcome! :D
I wish I could contribute on this, but sadly do not have a OTBR device to deploy code to and test against. :(
Is there any doc on getting a full dev environment set up with a minimal (hopefully inexpensive) device?
@philipflesher , you can try the OTBR Codelab, which builds on Raspberry Pi
Looks straightforward. I might get on this.
Realizing if I'm going to make changes, however, that I would need to simulate at least a medium-sized network, probably with some delays and failures. Is there any dev path for creating simulated networks that include routers and end devices?
Describe the bug The Topology view in the Web GUI frequently only loads partial router data, showing an incomplete graph, even in a well-functioning network.
Inspection of the underlying
/diagnostics
service shows that there are 2 timeouts associated with returning and cleaning diagnostics information:https://github.com/openthread/ot-br-posix/blob/172dcc145fd662533f7c8b8576cfc5a0c3fe18ca/src/rest/resource.cpp#L78-L82
As I read it, this means a
/diagnostics
API request will terminate after 2 seconds (fine), and at the callback, any diagnostics data older than 3 seconds will be purged from the list (maybe not fine.)Perhaps the design intent was for 20 seconds and 30 seconds? No notes or docs to clarify - but 2 & 3 seconds seems very optimistic.
Even a
networkdiagnostic get ff02::1 0 1 6
call takes 6 seconds to complete on my system.The actual behavior on my network with 21 nodes results in the Topology map showing a subset of actual network nodes, with as few as zero and as many as 13.
Looking at the code, and at the behavior, it seems that there's no way any information older than 3 seconds to make it back from a call to /Diagnostics.
To Reproduce
Run /diagnostics API on any thread network that routinely takes longer than 3 seconds to complete:
networkdiagnostic get ff02::1 0 1 6
run:
curl http://<your-otbr-instance>:8081/diagnostics
Inspect the ExtAddress entities and note that there are missing items.
or:
Open the topology viewer in the Web GUI and note whether all network nodes are displayed.
Expected behavior
/Diagnostics should enumerate available information to match the true network topology, instead it times out very quickly and returns partial results on anything larger than a trivially small network.
Console/log output
Node API information - note leader and number of routers.
Topology from same network:
Additional context
This is presumably only a real problem for the OTBR Web GUI Topology UI, as there is no published documentation for the REST API.
Recommended solution - test larger default values (e.g. 30 second
kDiagResetTimeout
timeout and 20 second `kDiagCollectTimeout' and consider adding timeout parameters to the API so apps can determine the response time and data freshness.Alternative solution - provide configuration parameters for the OTBR Web GUI to adjust these timeouts.