project-chip / certification-tool

A test harness and tooling designed to simplify development, testing, and certification for devices, guided by the Connectivity Standards Alliance.
https://csa-iot.org/
Apache License 2.0
42 stars 24 forks source link

[Bug] mDNSResponder: Default: mDNSCoreReceiveResponse: Unexpected conflict discarding #423

Open OlivierGre opened 2 months ago

OlivierGre commented 2 months ago

Describe the bug

I have installed the latest TH environment ("v2.11-beta3.1+fall2024") on my RPi4. To do so, I have had to upgrade it to Ubuntu 24.04 (A side comment: I have done an upgrade and not a reinstall from scratch).

The OTBR is started with : ./certification-tool/backend/test_collections/matter/scripts/OTBR/otbr_start.sh

I'm then facing a problem when I try to commission a Thread device. The Thread device receives the correct Dataset but don't succeed to joing the Thread network.

On the RPi, I have used the command "docker logs otbr-chip" to see the log. I'm attaching it here. In this log, I observe the following:

* Starting Avahi mDNS/DNS-SD Daemon avahi-daemon
Failed to open file '/etc/avahi/avahi-daemon.conf': No such file or directory
   ...fail!

According to this message on Slack (https://csamembers.slack.com/archives/C03MA7WR7Q8/p1668433913997599?thread_ts=1668432778.252049&cid=C03MA7WR7Q8) Avahi should not run at all in OTBR docker. This is mDNSResponder which is used instead. So it is maybe normal to have this avahi-daemon error.

In the log, I can see this message: Failed to write CLI output: Broken pipe Don't know how critical it is.

And last but not least: While looking at mDNSResponder (which is used instead of Avahi), I can see those errors:

Sep 24 09:46:46 ubuntu-rpi mDNSResponder: Default: mDNSCoreReceiveResponse: Received from 172.17.0.1:5353   20 1.0.17.172.in-addr.arpa. PTR ubuntu-rpi-4.local.
Sep 24 09:46:46 ubuntu-rpi mDNSResponder: Default: mDNSCoreReceiveResponse: Unexpected conflict discarding   18 1.0.17.172.in-addr.arpa. PTR ubuntu-rpi.local.

Do you think that my commissioning problem is due to this error? How can I fix those "Unexpected conflict"?

Steps to reproduce the behavior

Start the OTBR: ./certification-tool/backend/test_collections/matter/scripts/OTBR/otbr_start.sh

Look at the logs: docker logs otbr-chip

Expected behavior

No error

Log files

otbr_logs.txt

PICS file

No response

Screenshots

No response

Environment

TH v2.11-beta3.1+fall2024 on Ubuntu 24.04

Additional Information

No response

antonio-amjr commented 2 months ago

Hi @OlivierGre,

Are you trying the commissioning through the TH UI, right? Can you share the commissioning errors logs you're facing and more details on the test you're trying? Please share as well the project config and any other pertinent data.

OlivierGre commented 2 months ago

Hi Antonio, My final goal will be to commission with TH but, because I have had problems with previous TH version (v2.10.1+spring2024), I’m progressing by steps and I first check if commissioning is working when launching the OTBR manually. I realize that I have forgotten to mention that I call chip-tool to launch the commissioning. This is the reason of the confusion. Sorry for that.

So, to replicate the problem, I do:

./certification-tool/backend/test_collections/matter/scripts/OTBR/otbr_start.sh

I copy the dataset in an environment variable called DataSet. At that step, I check that the OTBR is in running state.

Then I launch the commissioning with: ./apps/chip-tool pairing ble-thread 1 hex:$DataSet 20202021 3840

The commissioning fails because the device is not able to join the operational network. I then check the logs with “docker logs otbr-chip”. I have attached the log. I have no problem if I try to commission this device to another Thread network (not based on this docker OTBR).

Shall I reassign the ticket to you?

Thank you

OlivierGre commented 2 months ago

I have some interesting findings:

Before installing "v2.11-beta3.1+fall2024", I was on "v2.10.1+spring2024". Commissioning was working when I started the OTBR manually with "otbr_start.sh" and chiptool but not when the OTBR was started by TH.

I have then moved to "v2.11-beta3.1+fall2024" and Commissioning is no more working at all.

Today, I have cloned the version "v2.10.1+spring2024" of certification-tool in a separate directory (certification-tool-old). NB: I have not called the auto-update script to avoid interference with my env for "v2.11-beta3.1+fall2024".

From this certification-tool-old directory, I have launched the OTBR: ./scripts/OTBR/otbr_start.sh (this is the former location for the script)

Then I have launched the commissioning. It is working without problem (I have tested it twice).

Then I have used the certification-tool directory for "v2.11-beta3.1+fall2024" and commissioning is still failing the same. The device doesn't succeed to attach.

So it seems that there is something wrong with the new scripts.

I'm attaching 4 logs:

When the commissioning is working, the docker log contains the same errors about Avahi-daemon, "Failed to write CLI output: Broken pipe" and "Unexpected conflict discarding XXXX".

log_2024_09_25_commissioning_nok_chiptool_traces.txt log_2024_09_25_commissioning_nok_docker_logs.txt log_2024_09_25_commissioning_ok_chiptool_traces.txt log_2024_09_25_commissioning_ok_docker_logs.txt

OlivierGre commented 2 months ago

I'm currently updating my RCP firmware. I was on rcp_thread_1_3_nrf52840dongle_nrf52840_v2.0.2.zip and I'm installing the version indicated in the user guide (https://groups.csa-iot.org/wg/matter-csg/document/34870) : otbr_9185bda_nrf52840dongle_ncs_2_43.zip

antonio-amjr commented 2 months ago

Hi @OlivierGre,

Thanks for the meticulous feedback. That's helpful. Let me know if you had any progress with the new firmware and share any news.

I'll analyze the files you shared and try to make some tests, maybe I spot something.

OlivierGre commented 2 months ago

Hi Antonio, I have the same problem with the new RCP. I've been blocked on this for one week and I look forward to continue my devs.

antonio-amjr commented 2 months ago

Hi Oliver,

I couldn't reproduce the error from my side unfortunately. I managed to start the OTBR and pair using the chip-tool command you mentioned before on v2.11-beta3.1+fall2024. I'm analyzing the logs still.

Can you test something out? Can you try to replace the current otbr-start.sh on the v2.11-beta3.1+fall2024 release with the old one?

I'm attaching below as a .txt file. You could just copy/paste over the the current one: otbr_start.txt

It's just a theory but maybe the new version with the fixed thread parameters could be causing a conflict. Let me know if that unblocks you.

OlivierGre commented 2 months ago

I'm in the middle of installing Ubuntu (Desktop) 24.04 from scratch on a new SD Card.

Yesterday I have followed your discussion in https://github.com/project-chip/certification-tool/issues/420 and I have tried the script modified with BR_VARIANT added. It didn't fix my issue.

I continue the installation on a new SD card. Are you using a nordic RCP?

antonio-amjr commented 2 months ago

I see, so you already tested the old script out...

Yeah, my test environment here is a nordic RCPDongle with the raspberry. Plus a nordic NRF52840DK Kit for the sample apps.

Let me know the results for the SD card from scratch.

OlivierGre commented 2 months ago

Hi Antonio, I have tried the version installed from scratch and I have the same problem. I have a RPi4 with a fresh Ubuntu 24.04 + a fresh "v2.11-beta3.1+fall2024" + a RCP with an uptodate Firmware.

I'm suspecting that it is related to the OTBR settings, I will give it a try in a while.

OlivierGre commented 2 months ago

I have used ot-ctl commands to set a DataSet that was working but it did not solve the problem.

Active Timestamp: 1
Channel: 25
Channel Mask: 0x07fff800
Ext PAN ID: 5b35dead5b35beef
Mesh Local Prefix: fde5:2503:736c:305f::/64
Network Key: 00112233445566778899aabbccddeeff
Network Name: 5b35
PAN ID: 0x5b35
PSKc: 3def89d2b8fc409eb0143d4450cfc0f7
Security Policy: 672 onrc
Done
OlivierGre commented 2 months ago

I should now try a different end device. I'm not available to try that in the coming hours, I will try later

antonio-amjr commented 2 months ago

Hi Olivier,

Ok, I'll be in standby for your results. I didn't spot any problem via logs for now. But I'll try to take another look.

OlivierGre commented 1 month ago

Hi Antonio, Since I have an "otbr_start.sh" script working and another one not working, I have done some tries to find what is causing the problem. I have found that it is due to the channel number. If I use 25 (like in old script), it works. If I use 15 or 11 or 17 or 20 it doesn't work. That's very strange. Would you have an idea what could be the problem? Unfortunately, I don't have another end device. In fact, I have a nanoleaf light bulb. Is it possible to commission it with chiptool? I have read the QRCode to get the discriminator and the setup pincode.

I don't think that the problem is due to the end device because, if I use a different Border Router (not the one in docker), I have no problem to commission this device to a BR set with Channel 15. So I'm wondering if the problem could come from the nrf52840 RCP dongle. On monday I will be able to try another one.

antonio-amjr commented 1 month ago

Hey Olivier,

That's interesting. I'll try to experiment with others channels and verify if any problem occurs. But I have no idea why would so many channels would be conflicted or unresponsive. Yeah, could be the dongle, so if you have access to another would be good to try.

From the docker logs I noted just failed UDP messages failing to be sent in the end, but the configuration part doesn't seem to present an error I think. Nothing that caught my eyes at least.

About the Nanoleaf, I never tried with TH and mine unfortunately is not hard resetting anymore (even with the 3s On, 1s off procedure), so I can't try from my side. But could work with the info of the QRCode. Let me know your results if you try it out.

antonio-amjr commented 1 month ago

@OlivierGre,

So I made some tests with channels and I managed to use the channels 11, 20 and 25 in sequence in my environment.

I attached the files testing the channel 20 for reference: log_chiptool_pairing_ble_thread_output.txt log_chiptool_unpair_ble_thread_output.txt log_docker_otbr_chip_output.txt log_otbr_start_output.txt

OlivierGre commented 1 month ago

Hi Antonio, Thank you for doing those tests. On my side, I have flashed a new nrf52840 RCP dongle (still with the version otbr_9185bda_nrf52840dongle_ncs_2_43.zip). I have the same problem. So the problem is not due to the RCP :(