openthread / openthread

OpenThread released by Google is an open-source implementation of the Thread networking protocol
https://openthread.io
BSD 3-Clause "New" or "Revised" License
3.54k stars 1.08k forks source link

SRP Client on the accessory used incorrect srp server IP address #6447

Closed Abhayakara closed 3 years ago

Abhayakara commented 3 years ago

The SRP client uses an IP address that isn't configured on the SRP server.

To Reproduce

Git commit id: d8a2cd9a70cfda7c2657645b4a57411eadaa4e61 Border router: Apple's implementation of Duckhorn router running on Raspi pi 4 Model B Accessory: running on nordic nrf52840-DK

Expected behavior We expect the SRP client to use the address advertised in the network data

Console/log output

I: 48707 [DIS]Found admin paring for admin 0, node BC5C01
D: 48712 [DIS]Using Thread MAC for hostname.
I: 48717 [SVR]Device completed Rendezvous process
E: 48911 [DL]Long dispatch time: 264 ms
D: 48915 [DL]In DriveBLEState
D: 48918 [DL]Indication for CHIPoBLE TX characteristic done (ConnId 0x00, result 0x00)
D: 48926 [DL]OpenThread State Changed (Flags: 0x1104d11d)
D: 48931 [DL]   Device Role: DETACHED
D: 48935 [DL]   Network Name: OpenThread
D: 48938 [DL]   PAN Id: 0xABCD
D: 48941 [DL]   Extended PAN Id: 0xDEAD00BEEF00CAFE
D: 48946 [DL]   Channel: 11
D: 48948 [DL]   Mesh Prefix: fdde:ad00:beef::/64
D: 48953 [DL]   Thread Unicast Addresses:
D: 48956 [DL]        fdde:ad00:beef:0:c41c:d4a5:3f62:b55e/64 valid preferred
D: 48963 [DL]        fe80::14fd:46fd:1be9:34db/64 valid preferred
D: 49486 [DL]OpenThread State Changed (Flags: 0x00000100)
I: 50507 [DL]SRP Client was started, as detected server addressed: fdf2:7a68:5f3e:108c:0:ff:fe00:fc11
D: 50517 [DL]OpenThread State Changed (Flags: 0x301332b7)
D: 50522 [DL]   Device Role: CHILD
D: 50525 [DL]   Network Name: MyHomeBR
D: 50528 [DL]   PAN Id: 0xABCD
D: 50531 [DL]   Extended PAN Id: 0x949CD1F473FF7008
D: 50536 [DL]   Channel: 11
D: 50538 [DL]   Mesh Prefix: fdf2:7a68:5f3e:108c::/64
D: 50543 [DL]   Partition Id: 0x33AA34DE
D: 50547 [DL]   Thread Unicast Addresses:
D: 50551 [DL]        fd31:296e:280b:0:fd04:43b4:5bc1:85ca/64 valid preferred
D: 50558 [DL]        fdf2:7a68:5f3e:108c:0:ff:fe00:c04/64 valid preferred rloc
D: 50565 [DL]        fdf2:7a68:5f3e:108c:c41c:d4a5:3f62:b55e/64 valid preferred
D: 50572 [DL]        fe80::14fd:46fd:1be9:34db/64 valid preferred
I: 50578 [NP]Sending IP Address. Delegate 0x20005b8c
D: 50583 [IN]Secure message was encrypted: Msg ID 0
I: 50588 [IN]Sending msg from 0x0000000000bc5c01 to 0x0000000000000000 at utc time: 50588 msec
I: 50596 [IN]Sending secure msg on connection specific transport
D: 50602 [DL]Sending indication for CHIPoBLE TX characteristic (ConnId 0, len 20)
I: 50609 [IN]Secure msg send status No Error
D: 50613 [DL]OpenThread State Changed (Flags: 0x00000001)
D: 50619 [DL]   Thread Unicast Addresses:
D: 50622 [DL]        fd31:296e:280b:0:fd04:43b4:5bc1:85ca/64 valid preferred
D: 50629 [DL]        fdf2:7a68:5f3e:108c:0:ff:fe00:c04/64 valid preferred rloc
D: 50636 [DL]        fdf2:7a68:5f3e:108c:c41c:d4a5:3f62:b55e/64 valid preferred
D: 50643 [DL]        fe80::14fd:46fd:1be9:34db/64 valid preferred
I: 50649 [NP]Sending IP Address. Delegate 0x20005b8c
D: 50655 [IN]Secure message was encrypted: Msg ID 1
I: 50659 [IN]Sending msg from 0x0000000000bc5c01 to 0x0000000000000000 at utc time: 50659 msec
I: 50668 [IN]Sending secure msg on connection specific transport
I: 50673 [IN]Secure msg send status No Error
D: 50796 [DL]Indication for CHIPoBLE TX characteristic done (ConnId 0x00, result 0x00)
D: 50804 [DL]Write request received for CHIPoBLE RX characteristic (ConnId 0x00)
D: 50811 [DL]Sending indication for CHIPoBLE TX characteristic (ConnId 0, len 20)
D: 50891 [DL]Indication for CHIPoBLE TX characteristic done (ConnId 0x00, result 0x00)
D: 50899 [DL]Sending indication for CHIPoBLE TX characteristic (ConnId 0, len 20)
D: 50988 [DL]Indication for CHIPoBLE TX characteristic done (ConnId 0x00, result 0x00)
D: 50996 [DL]Write request received for CHIPoBLE RX characteristic (ConnId 0x00)
D: 51003 [DL]Sending indication for CHIPoBLE TX characteristic (ConnId 0, len 20)
D: 51085 [DL]Indication for CHIPoBLE TX characteristic done (ConnId 0x00, result 0x00)
D: 51093 [DL]Sending indication for CHIPoBLE TX characteristic (ConnId 0, len 9)
D: 51183 [DL]Indication for CHIPoBLE TX characteristic done (ConnId 0x00, result 0x00)
D: 51191 [DL]Write request received for CHIPoBLE RX characteristic (ConnId 0x00)
D: 51198 [DL]Sending indication for CHIPoBLE TX characteristic (ConnId 0, len 20)
D: 51281 [DL]ConnId: 0x00, New CCCD value: 0x0000
D: 51285 [DL]SetAdvertisingEnabled(false)
I: 51289 [SVR]Persisting admin ID 0, next available 1
I: 51481 [SVR]Persisting admin ID successfully
I: 51486 [SVR]OnRendezvousError: Ble Error 6004 (0x00001774): BLE central unsubscribed
I: 51494 [BLE]Releasing end point's BLE connection back to application.
E: 51500 [DL]Long dispatch time: 219 ms
D: 51504 [DL]Indication for CHIPoBLE TX characteristic done (ConnId 0x00, result 0x00)
E: 51511 [BLE]no endpoint for BLE sent data ack
D: 51516 [DL]In DriveBLEState
E: 52593 [DL]OnSrpClientNotification: Timed out waiting on server response
I: 53669 [DL]BLE GAP connection terminated (reason 0x13)
I: 53674 [DL]Current number of connections: 0/1
D: 53678 [DL]In DriveBLEState
E: 55686 [DL]OnSrpClientNotification: Timed out waiting on server response
E: 60921 [DL]OnSrpClientNotification: Timed out waiting on server response
E: 69794 [DL]OnSrpClientNotification: Timed out waiting on server response
E: 84856 [DL]OnSrpClientNotification: Timed out waiting on server response
E: 110423 [DL]OnSrpClientNotification: Timed out waiting on server response
E: 153858 [DL]OnSrpClientNotification: Timed out waiting on server response
E: 227671 [DL]OnSrpClientNotification: Timed out waiting on server response
E: 353131 [DL]OnSrpClientNotification: Timed out waiting on server response
E: 566380 [DL]OnSrpClientNotification: Timed out waiting on server response 

Here are the addresses configured on the Thread interface on the BR:

wpan0: flags=4305<UP,POINTOPOINT,RUNNING,NOARP,MULTICAST>  mtu 1500
        inet6 fdf2:7a68:5f3e:108c:eba6:56f3:c6c2:9d91  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::e015:3479:fe5:62fb  prefixlen 64  scopeid 0x20<link>
        inet6 fdf2:7a68:5f3e:108c:0:ff:fe00:fc00  prefixlen 64  scopeid 0x0<global>
        inet6 fd31:296e:280b:0:4a1f:c44f:47e1:a17b  prefixlen 64  scopeid 0x0<global>
        inet6 fdf2:7a68:5f3e:108c:0:ff:fe00:fc10  prefixlen 64  scopeid 0x0<global>
        inet6 fdf2:7a68:5f3e:108c:0:ff:fe00:c00  prefixlen 64  scopeid 0x0<global>
        unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00  txqueuelen 500  (UNSPEC)
        RX packets 83  bytes 4781 (4.6 KiB)
        RX errors 0  dropped 1  overruns 0  frame 0
        TX packets 24  bytes 3533 (3.4 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Additional context Add any other context about the problem here.

The code here seems to be constructing an address in a way that's not valid and doesn't actually produce an address that's configured on the Border Router (SRP server):

https://github.com/openthread/openthread/blob/main/src/core/thread/network_data_service.cpp#L163

As you can see in the console output, the address is constructs is this:

fdf2:7a68:5f3e:108c:0:ff:fe00:fc11

This is a mesh-local address, which isn't in the advertisement, and is not actually an address configured on the BR.

jwhui commented 3 years ago

@Abhayakara , thanks for raising this issue!

My guess is that the Service/Server TLV format used to advertise the SRP Server is not compatible between the two implementations. OpenThread currently only includes the RLOC16 and Port information, rather than the full IPv6 address.

Can you provide one or both of:

  1. Service/Server TLV format used by Apple's implementation.
  2. Output of netdata show using the OpenThread CLI.
abtink commented 3 years ago

@Abhayakara I think as Jonathan said the format of the TLVs are different between the implementations. As you know we are still discussing this in Thread TC calls (the mechasnim on how to discover SRP servers). Moving forward I think we want to converge and use the model we discussed:


The address here df2:7a68:5f3e:108c:0:ff:fe00:fc11 seems to be actually the anycast address associated with network data service entry. I guess the client was running as a sleepy device and is not configured to get full network-data (so it only gets the ALOC16 in the network-data from which the anycast address is derived).

The current OT implementation uses the following format for ServerTlv data (basically just a port number). https://github.com/openthread/openthread/blob/a9d43fac5da39cc108cb3b57eb7df835db611c80/src/core/thread/network_data_service.hpp#L174-L207 The service number itself can be configured using OPENTHREAD_CONFIG_SRP_SERVER_SERVICE_NUMBER (which seems to be set by default to 0x5d).

The question is if we want to add support for the current format of Service TLV used by Apple implementation (for legacy compatibility) or if we can move to the new model. If we do need it for legacy compatibility I see two ways:

Thoughts?

abtink commented 3 years ago

@Abhayakara, is wpantund being used on the border router side (on RPi)?

If so, please note that wpantund default behavior is to filter (not add) the RLOC and ALOC IPv6 addresses from NCP on the host "wpan0" netif that it manages (there are some historical reasons why this behavior was desired and become default mode).

I guess this may be the reason why the anycast address is not seen in the list of addresses on "wpan0" on the BR.

This behavior can be changed through two config properties in wpantund:

Abhayakara commented 3 years ago

It's going to be necessary to be able to advertise an SRP service that's not on the Thread network at all, so it makes sense to support the ability to choose a server that's got an IPv6 address that's not mesh-local and not anycast. Separately, we should support anycast. Supporting the external SRP server is backwards-compatible with what HomePod Minis currently advertise, so this gets us backward compatibility for free.

Using different formats for different implementations won't be interoperable, so that's definitely not a good plan. That is, you couldn't mix and match Apple routers and other routers, for example. Given that anycast gives us what we want as the default behavior, I think just specifying the IPv6 address and port is a good secondary behavior, and we don't need the flexibility of having yet another format.

We are not using wpantund on the Raspberry Pi Duckhorn Border Router implementation.

jwhui commented 3 years ago

It's going to be necessary to be able to advertise an SRP service that's not on the Thread network at all, so it makes sense to support the ability to choose a server that's got an IPv6 address that's not mesh-local and not anycast.

In the spirit of minimizing optionality, what about having the border router serve as a proxy when the SRP service is not on the Thread network? Thread devices then only need to worry about the anycast address and the network data is more compact. In either case, the border router will need to know the IP and port of the SRP service.

Abhayakara commented 3 years ago

That would be a bit more work on the border router. Does this really buy us anything? In the default case the network is going to be more compact anyway, right?

jwhui commented 3 years ago

That would be a bit more work on the border router. Does this really buy us anything? In the default case the network is going to be more compact anyway, right?

I agree it shifts the complexity from SRP clients to the border router.

I can also see the value of maintaining the flexibility of specifying the IPv6 address and port.

So we can support two separate SRP service types to encode anycast and unicast. If both appear in the network data for whatever reason, can we have the SRP client always prefer one? It'd be simpler for the SRP client to only have to deal with maintaining registrations with a single SRP server.

Abhayakara commented 3 years ago

If we can signal the availability of anycast without advertising it as a service that might be better. The way my client works is to try successive servers if it doesn't get through to the first one. Anycast makes this impossible, because the client has no way to choose a specific server. I think that if anycast is present, it makes sense to just register with the anycast server; BRs shouldn't advertise anycast if there's an infrastructure SRP server. We'd need to specify how that's configured, of course.

jwhui commented 3 years ago

If we can signal the availability of anycast without advertising it as a service that might be better.

I'm not sure I understand the distinction between "anycast" and "service". Is the goal generalize the anycast address to support more than just SRP?

I believe just signaling the availability of anycast in Thread requires the Service TLV mechanism. The Thread network data needs to include identifiers (RLOCs) for each of the border routers that are announcing the availability of anycast - it's how Thread routers know where they can route anycast packets.

I think the best path forward is to just have two different Service TLVs. The "Service TLV" is just a name for a mechanism. The SRP client can then prefer anycast if present.

Anycast TLV:

  1. Type (1 byte)
  2. Length (1 byte)
  3. Flags (1 byte)
  4. Service Data Length (1 byte)
  5. Service Number (1 byte)
  6. SRP number for re-registration (1 byte)
  7. ... Server TLVs ...

SRP Infrastructure Service TLV:

  1. Type (1 byte)
  2. Length (1 byte)
  3. Flags (1 byte)
  4. Service Data Length (1 byte)
  5. Service Number (1 byte)
  6. IPv6 Address (16 bytes)
  7. Port (2 bytes)
  8. ... Server TLVs ...

Both would include the same Server TLVs that just encode the RLOC16s.

abtink commented 3 years ago

The two service TLV definitions look great to me.

I think in both ServiceTLV definitions in between 4 and 5 we need one more byte for a "service number" (as part of the Service TLV data blob). So:

The Anycast model Service TLV Data is 2 bytes:

The SRP infra Service TLV Data is 19 bytes:

jwhui commented 3 years ago

I think in both ServiceTLV definitions in between 4 and 5 we need one more byte for a "service number" (as part of the Service TLV data blob).

Yes, thanks for catching that. I updated my comment above.

Abhayakara commented 3 years ago

Right. So with the IP address service, we have the enterprise number (44970), one byte of service data (0x5d), and 18 bytes of server data, with the IP address and port number in network byte order.

abtink commented 3 years ago

Sounds good. 44970 is Thread Enterprise number (which is encoded as compressed in Service TLV). 0x5d would be the service number (for The SRP infra Service TLV model).

In Jonathan's suggestion from https://github.com/openthread/openthread/issues/6447#issuecomment-819952599, the IP and port are part of the Service TLV data and not Server data.

I think both cases (including it in Service or Sever data) can be useful. If there is a common infra SRP server and multiple BRs want to add it in network data it would be good to add the info in Service Data (otherwise we can end up with the same address being encoded in network data multiple times).

One idea is to make it flexible, i.e. when parsing network-data for such entries, we can allow/accept address and port info to be encoded in either Service Data or Server Data. I think this will be relatively straight-forward to implement and with this we can be compatible with the current model used by Apple implementation/devices while supporting future models.

Thoughts?

Abhayakara commented 3 years ago

I'm not sure how big a problem this is, but with this approach, existing thread accessories that interoperate with HomePod will not successfully discover the service if it's in the service data. That said, we definitely want this efficiency, and this is probably the best way to get it.

abtink commented 3 years ago

Submitted PR https://github.com/openthread/openthread/pull/6501 which implements what we discussed above.

Adding some quick notes on this:

jwhui commented 3 years ago

Resolved by #6501