openthread / ot-br-posix

OpenThread Border Router, a Thread border router for POSIX-based platforms.
https://openthread.io/
BSD 3-Clause "New" or "Revised" License
418 stars 232 forks source link

[netif] Failed to process request#2: No such process #2321

Closed OlivierGre closed 3 months ago

OlivierGre commented 4 months ago

I feel a bit sorry to ask you some advices but I'm stuck.

I've been working on Thread/Matter for a couple of years and I seldom have to "touch" at the OTBR. Yesterday I have decided to rebuild it to add the option to monitor the Thread channels usage.

Since that time, I fail to get the OTBR up and running again and it is making me crazy!

Yesterday I have changed a few things simultaneously (I was maybe too confident!):

I'm attaching the syslog when the Raspberry Pi is booting and starting the otbr-agent.

Would you know what could cause this error?

Jun  6 14:24:30 raspberrypi otbr-agent[785]: 00:00:00.332 [I] Platform------: [netif] Host netif is down
Jun  6 14:24:30 raspberrypi otbr-agent[785]: 00:00:00.333 [I] Platform------: [netif] Succeeded to process request#1
Jun  6 14:24:30 raspberrypi otbr-agent[785]: 00:00:00.333 [W] Platform------: [netif] Failed to process request#2: No such process

Is it normal that netif is down at that time? Thank you, regards

syslog_2024_06_06.log

superwhd commented 4 months ago

I used to see the similar 'netif' logs on a well-functioning BR so it may just be a false alarm. I was planning to fix the logging but haven't got some time.

To confirm this, you may just try listing the IP addresses on wpan0 (ifconfig wpan0) and compare it with ot-ctl ipaddr. If they have the same set of addresses, then it's fine.

I didn't succeed to get the OTBR running

The log file looks normal so far. Could you elaborate how you noticed the failure? Did the otbr-agent service fail to start? Or you observed disconnectivity of devices?

OlivierGre commented 4 months ago

The OTBR agent is not working properly, that why I have looked at the syslog. For example this command is failing:

pi@raspberrypi:~ $ sudo ot-ctl dataset active
Error 23: NotFound

In my previous message, I had filtered the syslog to keep only the things related to the otbr-agent. Here is the full log. syslog.log

I can see the following errors:

Jun  6 14:54:41 raspberrypi lightdm[714]: Error getting user list from org.freedesktop.Accounts: GDBus.Error:org.freedesktop.DBus.Error.ServiceUnknown: The name org.freedesktop.Accounts was not provided by any .service files

Jun  6 14:54:41 raspberrypi tayga[768]: error: unexpectedly disconnected from boot status daemon

Here is the otbr-agent status

pi@raspberrypi:~ $ sudo service otbr-agent status
● otbr-agent.service - OpenThread Border Router Agent
     Loaded: loaded (/lib/systemd/system/otbr-agent.service; enabled; vendor preset: enabled)
     Active: active (running) since Thu 2024-06-06 14:54:41 CEST; 12min ago
    Process: 429 ExecStartPre=/usr/sbin/service mdns start (code=exited, status=0/SUCCESS)
   Main PID: 772 (otbr-agent)
      Tasks: 1 (limit: 3933)
        CPU: 432ms
     CGroup: /system.slice/otbr-agent.service
             └─772 /usr/sbin/otbr-agent -I wpan0 -B wlan0 spinel+hdlc+uart:///dev/ttyUSB0 trel://wlan0

Jun 06 14:54:42 raspberrypi otbr-agent[772]: 00:00:00.296 [I] RouterTable---: Route table
Jun 06 14:54:42 raspberrypi otbr-agent[772]: 00:00:00.300 [I] Platform------: [netif] Host netif is down
Jun 06 14:54:42 raspberrypi otbr-agent[772]: 00:00:00.300 [I] Platform------: [netif] Succeeded to process request#1
Jun 06 14:54:42 raspberrypi otbr-agent[772]: 00:00:00.300 [W] Platform------: [netif] Failed to process request#2: No such process
Jun 06 14:54:42 raspberrypi otbr-agent[772]: [INFO]-MDNS----: Received reply for service OpenThread BorderRouter #25A0._meshcop._>
Jun 06 14:54:42 raspberrypi otbr-agent[772]: [INFO]-MDNS----: Successfully registered service OpenThread BorderRouter #25A0._mesh>
Jun 06 14:54:42 raspberrypi otbr-agent[772]: [INFO]-BA------: Result of publish meshcop service OpenThread BorderRouter #25A0._me>
Jun 06 14:54:42 raspberrypi otbr-agent[772]: [INFO]-BA------: Result of publish meshcop service OpenThread BorderRouter #25A0._me>
Jun 06 14:54:46 raspberrypi otbr-agent[772]: 00:00:04.952 [I] InfraIf-------: State changed: NOT RUNNING -> RUNNING
Jun 06 15:02:21 raspberrypi otbr-agent[772]: 00:07:21.141 [I] Platform------: Session socket is ready
superwhd commented 4 months ago

What's the output of ot-ctl ifconfig and ot-ctl state?

pi@raspberrypi:~ $ sudo ot-ctl dataset active Error 23: NotFound

From this log it indicates that you haven't set active dataset. The logs it also indicates your OTBR didn't attempt to join any network.

You can either:

  1. Set an existing dataset
    ot-ctl dataset set active <dataset hex>
  2. Create a new dataset
    ot-ctl dataset init new
    ot-ctl dataset commit active

Please also ensure you have started thread by ot-ctl ifconfig up and ot-ctl thread start.

OlivierGre commented 4 months ago
pi@raspberrypi:~ $ sudo ot-ctl ifconfig
up
Done
pi@raspberrypi:~ $ sudo ot-ctl state
detached
Done

I have a script which configures the dataset:

#!/bin/bash 
sudo ot-ctl thread stop 
sudo ot-ctl ifconfig down 

sudo ot-ctl channel 15 
sudo ot-ctl extpanid ################# 
sudo ot-ctl networkkey #################
sudo ot-ctl networkname ################# 
sudo ot-ctl panid ################# 
sudo ot-ctl pskc ################# 

sudo ot-ctl ifconfig up 
sudo ot-ctl thread start 

echo "Please wait..." 
echo "" 
sleep 5 
sudo ot-ctl dataset active 
echo "" 
sudo ot-ctl netdata show 

Do I miss some dataset command to set this as the active dataset? Thanks

OlivierGre commented 4 months ago
pi@raspberrypi:~ $ sudo ot-ctl dataset init new
Done
pi@raspberrypi:~ $ 
pi@raspberrypi:~ $ sudo ot-ctl dataset commit active
Done

Then "sudo ot-ctl dataset active" is working but is not using my settings.

If I call my script after that, the commands are OK but I still get "Error 23: NotFound" when I ask to display the active dataset.

I probably have missunderstood something about the use of this active dataset. As indicated on my first message, I seldom have to look at the OTBR.

jwhui commented 4 months ago

Thread requires configuring the full Operational Dataset to work properly.

See README_DATASET.md for more information on how to properly configure the Operational Dataset.

OlivierGre commented 4 months ago

I don't know why what we were doing till now to set the dataset is no more working. Nevermind...

I can see that the Active Operational Dataset contains:

Is it possible to init those fields with default values and then only update the ones that we want to force? (channel, extpanid, networkkey, networkname, panid, pskc)

I'm also wondering if a Matter Commissioner needs to know the other fields:

jwhui commented 4 months ago

You can initialize the Dataset first and then change specific fields. To initialize:

ot-ctl dataset init new

A Matter Commissioner needs to know the full Operational Dataset. It doesn't really need to know the individual fields, it just needs to have the TLV-encoded Operational Dataset as a binary object, which is what Matter does.

OlivierGre commented 4 months ago

Thank you Jonathan.

A Matter Commissioner needs to know the full Operational Dataset. It doesn't really need to know the individual fields, it just needs to have the TLV-encoded Operational Dataset as a binary object, which is what Matter does.

On Commissioner side, we are setting the individual fields mainly to be able to change of channel (if a channel is too busy). We of course update the OTBR channel accordingly.

By the way, I have activated the "channel monitor" feature on my OTBR but I'm getting strange results. I'm at home today and my OTBR is configured with Channel 15. When I call the "channel monitor" command, it says ch 11 is busy (with a % varying around 16 to 33) and all the other channels (including channel 15) are at 0% busy. I don't understand why and I'm wondering if this monitoring is reliable.

abtink commented 4 months ago

By the way, I have activated the "channel monitor" feature on my OTBR but I'm getting strange results. I'm at home today and my OTBR is configured with Channel 15. When I call the "channel monitor" command, it says ch 11 is busy (with a % varying around 16 to 33) and all the other channels (including channel 15) are at 0% busy. I don't understand why and I'm wondering if this monitoring is reliable.

Channel monitor runs in background taking RSSI samples infrequently and over time builds a picture of noise/inference level on channels. Suggest at least 8 hours of it running. It is intended as a utility tool for debugging.

I can refer you to its related configs and its API documentations which can help you understand what it does better.

OlivierGre commented 4 months ago

A Matter Commissioner needs to know the full Operational Dataset. It doesn't really need to know the individual fields, it just needs to have the TLV-encoded Operational Dataset as a binary object, which is what Matter does.

I'm coming back to this point because it is not clear to me: Matter's CHIPTool application is setting the dataset from the following individual fields:

Screenshot: Screenshot_20230925_091504_CHIPTool

During a Commissioning, Chiptool is concatenating those 4 individual fields into a byte array and it is passing it to Matter SDK, which will pass it to the commissioned device. Is it enough to pass those 4 fields to the commissioned device? It seems to be in contradiction with your sentence "A Matter Commissioner needs to know the full Operational Dataset", isn't it? I'm aware that I'm now out of OTBR scope :)

jwhui commented 4 months ago

@OlivierGre

Matter's CHIPTool application is setting the dataset from the following individual fields:

This is an implementation detail. Matter Specification defines the Network Commissioning Cluster. The AddOrUpdateThreadCommand has an OperationalDataset field and states:

The encoding for the OperationalDataset field is defined in the Thread specification.

The client SHALL pass the OperationalDataset as an opaque octet string.
OlivierGre commented 4 months ago

Thanks for the explanation. I have opened Thread 1.3 specification to see the content of the OperationalDataset field.

With Chiptool source code open in parallel, I can see what is put in the OperationalDataset TLV and it corresponds to "8.10.1 Network Managements TLVs":

At that stage, it was not obvious to me to understand what "operational dataset" we are talking about. I have looked at the active and pending operational datasets and it didn't match. I think I have finally found the right one. I see it in 8.4.6.2 Secure Out of band Commissioning. 8.4.6.3.3 defines the "Network join request" and it requires the following parameters:

jwhui commented 4 months ago

I see it in 8.4.6.2 Secure Out of band Commissioning.

The interfaces here are purely conceptual. Matter does not use this.

As I mentioned, Matter uses the Network Commissioning Cluster which contains a OperationalDataset field. That refers to the Active Operational Dataset in the Thread Specification.

OlivierGre commented 4 months ago

As I mentioned, Matter uses the Network Commissioning Cluster which contains a OperationalDataset field. That refers to the Active Operational Dataset in the Thread Specification.

It is getting clearer but some information is still missing: As you said, Matter uses the Network Commissioning Cluster which contains a OperationalDataset field. The encoding for this OperationalDataset field is defined in the Thread specification.

I'm looking at Thread 1.3 specification to see the content of the Active Operational Dataset. "4.4.23 Active Operational Dataset TLV" indicates to look at "§8.10.1 Network management TLVs" which lists the possible fields. I'm trying to understand which ones are mandatory and which ones are optional. Right now I haven't found this information in Thread specification.

In a previous message, you said that a "Full Operational Dataset" should be transmitted. So I would expect all the TLV fields defined in 8.10.1 are sent but this is not what I observe when I look at the content of Matter Network Commissioning Cluster sent by ChipTool. I see those 4 fields only:

And I can see that the Matter device (which is receiving this information) successfully joins the Thread network so that's seems sufficient.

Can you please point me the paragraph in Thread specification indicating the fields that should be present in the Active Operational Dataset?

Thank you

OlivierGre commented 4 months ago

While doing some tests, I’m observing something Weird! I have noticed that, if I set some incorrect values for Channel, PAN ID and Extended PAN ID in ChipTool, the Commissioning still works. If I set a wrong Network Key, the commissioning fails (I can see that the device doesn’t succeed to join the network).

Could it be due to the Security Policy used on my Border Router? On my Border Router, I can see : Security Policy : 672 onrc 0

jwhui commented 4 months ago

@OlivierGre , OpenThread can support commissioning with only the Network Key. In this case, OpenThread will scan across all channels and send request to the broadcast PAN ID. Extended PAN ID and Network Name provide no value during the attach process.

Providing channel is more efficient because the device does not have to scan across channels.

Providing PAN ID is more efficient because the device does not have to send requests to the broadcast PAN ID.

A Thread device must have the full Operational Dataset to fully participate in a Thread network. As a result, when only providing a partial Operational Dataset, the Thread device must first attach to a neighboring Thread device so that it can retrieve the full Operational Dataset. For example, when only providing a partial dataset to a router-capable device, that device cannot serve as a router/leader until it can fully attach to a neighboring Thread device.

However, when providing the full Operational Dataset, then the device can operate as a full router/leader without having to first attach to a neighboring Thread device. This is especially useful in scenarios like replacing a broken device.

OlivierGre commented 4 months ago

Thank you so much Jonathan for this very interesting and useful information!