Broadlink Component - Handling Of Communication Errors

Silvenga commented 4 years ago

The problem

It appears that the Broadlink component will not process commands after the remote device is marked as unavailable.

Environment

Home Assistant Core release with the issue: 0.112.0
Last working Home Assistant Core release (if known): Unknown.
Operating environment (OS/Container/Supervised/Core): Supervised
Integration causing this issue: homeassistant.components.broadlink
Link to integration documentation on our website: https://www.home-assistant.io/integrations/broadlink/

Problem-relevant `configuration.yaml`

remote:
- platform: broadlink
  host: 192.168.2.144
  mac: blah
  type: rm4_mini
  name: Servers Broadlink
sensor:
- platform: broadlink
  host: 192.168.2.144
  mac: blah
  type: rm4_mini
  name: Servers Broadlink
  scan_interval: 60
  monitored_conditions:
    - temperature
    - humidity

Traceback/Error logs

cat home-assistant.log | grep broad
2020-07-01 21:19:34 INFO (SyncWorker_36) [homeassistant.loader] Loaded broadlink from homeassistant.components.broadlink
2020-07-01 21:19:34 INFO (MainThread) [homeassistant.components.remote] Setting up remote.broadlink
2020-07-01 21:19:37 INFO (MainThread) [homeassistant.components.sensor] Setting up sensor.broadlink
2020-07-01 22:01:21 WARNING (MainThread) [homeassistant.components.broadlink.device] Disconnected from device at 192.168.2.144: Control key is expired
2020-07-01 22:02:17 WARNING (MainThread) [homeassistant.components.broadlink.device] Connected to device at 192.168.2.144
2020-07-01 23:31:47 WARNING (MainThread) [homeassistant.components.broadlink.device] Disconnected from device at 192.168.2.144: Control key is expired
2020-07-01 23:32:39 WARNING (MainThread) [homeassistant.components.broadlink.device] Connected to device at 192.168.2.144
2020-07-02 01:02:10 WARNING (MainThread) [homeassistant.components.broadlink.device] Disconnected from device at 192.168.2.144: Control key is expired
2020-07-02 01:02:10 ERROR (MainThread) [homeassistant.components.broadlink.remote] Failed to send 'fan only/server ac': The device is offline
2020-07-02 02:32:25 WARNING (MainThread) [homeassistant.components.broadlink.device] Disconnected from device at 192.168.2.144: Control key is expired
2020-07-02 02:33:20 WARNING (MainThread) [homeassistant.components.broadlink.device] Connected to device at 192.168.2.144
2020-07-02 05:23:55 WARNING (MainThread) [homeassistant.components.broadlink.device] Disconnected from device at 192.168.2.144: Control key is expired
2020-07-02 05:24:51 WARNING (MainThread) [homeassistant.components.broadlink.device] Connected to device at 192.168.2.144
2020-07-02 09:10:22 WARNING (MainThread) [homeassistant.components.broadlink.device] Disconnected from device at 192.168.2.144: Control key is expired
2020-07-02 09:11:15 WARNING (MainThread) [homeassistant.components.broadlink.device] Connected to device at 192.168.2.144

Additional information

Note that the sensor module still operates every 60 seconds, which suggests the device was accessible after 2020-07-02 01:02:10 (as can be seen by the WARNING logs, which is a known issue with minimal impact).
Restarting the device has no effect.
Restarting HA seems the only mitigation.
HA reports the device was unavailable since 2020-07-02 01:02:10, which suggests to me that being unavailable was the cause. Manually instructing HA to send commands has no impact when the device is unavailable.
Checking the logs of the Wifi Access Point, the broadlink disconnects and reconnects 20 times an hour. This appears to be normal (roaming from one access point to another).
This is reproducible on demand.
- Send command to verify connectivity.
- Disconnect Broadlink device from power.
- Send command (which will fail).
- Reconnect and allow Broadlink device to power up.
- Send command and note that the command is never attempted (note that no communication error is logged and the device is marked as unavailable).
- Restart HA and note that sending commands is now possible.

marcan commented 3 years ago

@litinoveweedle if echo is troublesome for you, you can try perl:

perl -e 'print "\0" x 38 . "\1" . "\0" x 9' | nc <...>

litinoveweedle commented 3 years ago

Maybe you're using a different shell?

Sorry for late reply, I had to do some work out of the home. Yes, were are completely right, it was default sh shell, with bash everything WORKS! Thank you so much for you solving this issue. :-)

felipediel commented 3 years ago

There is a misunderstanding here. I don't have access to the logs. I don't know if my devices are reconnecting every 3 minutes. I am using a simple ISP router with poor interface. What I said is: I blocked outbound connections from these devices with a drop silent rule (the only option I have) and I don't miss any updates in Home Assistant. They are completely isolated and everything works fine. I can also ping them rock-solid for a long time:

--- 192.168.0.17 ping statistics ---
46033 packets transmitted, 45932 received, +24 errors, 0% packet loss, time 46104326ms
rtt min/avg/max/mdev = 1.352/22.035/826.071/83.565 ms, pipe 4

I am not saying the VLAN is the main problem. The root of the problem is definitely the successive reconnections when the device cannot reach the cloud. What I was trying to understand here is why my devices recover quickly and yours don't.

@litinoveweedle I recently fixed an issue involving Broadlink devices and VLANs. It was related to socket.bind(). This is why I am insisting on this. Perhaps we are binding the socket to the wrong interface when we try to reach an offline device. But if you tested and the VLAN is not the problem, great, thank you. I believe in you. Now we are cooperating again. Did you check the logs in Home Assistant? This is what I wanted you to test.

I know that testing can be boring at times. I usually do my own tests, but when I don't have the same tech as the person I'm helping, I have to ask. I don't like when I ask for a test and people write a thousand line text saying that I'm stupid and the test won't work. It is not meant to work, we are just gathering information. I usually do the logical thinking after the tests. So I think that's why we started off bad. I had a stressful week, so I'm sorry if I was rude at some point. I am just tired.

@marcan I was starting to find you annoying, but this is really great news. I think I underestimated you. Thank you, you did a great job and earned my respect. I am adding this to the list of things I will bring to Home Assistant :rocket:

litinoveweedle commented 3 years ago

@litinoveweedle I recently fixed an issue involving Broadlink devices and VLANs. It was related to socket.bind(). This is why I am insisting on this. Perhaps we are binding the socket to the wrong interface when we try to reach an offline device. But if you tested and the VLAN is not the problem, great, thank you. I believe in you. Now we are cooperating again. Did you check the logs in Home Assistant? This is what I wanted you to test.

@felipediel I actually know about that bug, as it was affecting me as well. But it was not about VLANs (i.e. multiple dedicated LAN segments = L2 level of ISO/OSI model), but it was about wrongly selected IP of local HA interface for outgoing packets (i.e. multiple IP interfaces on the HA and/or multiple IP subnets on the same interface on HA = L3 level of ISO/OSI model)

I know this previous issue was solved in the 0.15.0 version of python_broadlink library as from that one discovery of Broadlink devices works OK. But this problem was completely different, although I know it could be confusing.

From the time I applied fix with keep-alive packets as discovered by @marcan my Broadlink devices are not disconnecting/reconnecting and therefore there are no more error messages in the HA log.

So to summarize this issue:

You added log message to report on network problem / connection to Broadlink device.
I reported that with no internet access Broadlink device are periodically restarting, triggering these error messages in the HA log.
@marcan discovered keep-alive packets to fools Broadlink device to think that is is connected to cloud even without any connection to internet and he provided hack to generate these fake keep-alive packets from HA.
I applied that fix in my HA and now my Broadlink device are not anymore restarting and therefore there are no more error messages in the HA log

Therefore I would like to ask you to accept proposal in mjg59/python-broadlink#458 - add into the library code to periodically generate keep-alive packets with payload as suggested. This would prevent Broadlink devices with no internet access from disconnecting/reconnecting/rebooting or whatever they do. :-) Thank you.

felipediel commented 3 years ago

I will do it. But could you please do that last test (blocked/single network/check for updates in Home Assistant)? Because if we also have a binding problem, I want to fix it.

marcan commented 3 years ago

Ok, things make more sense now. I actually had patched python-broadlink months back to fix the discovery issue with the incorrect binding, which as @litinoveweedle said is not a VLAN issue but just a general multiple interfaces issue (I was actually only using discovery on my laptop during configuration with the BroadlinkProv AP method, so I didn't have to use the Broadlink app, and my laptop doesn't use VLANs). The devices themselves only have one interface so this kind of problem does not apply.

@felipediel my devices recover quickly (2-3s) during reboots; I'm actually on a somewhat old HA/integration version so I'm not saying the current version can't work around it with retries and timeouts. But of course it's not ideal that the devices do this, which is why I wanted to find a way to solve the root cause. To me it seemed that you were saying that your devices weren't rebooting at all and that this somehow had to do with VLANs.

Even with retries and timeouts and such, I expect there will be impossible to fix race conditions (for example, if you send it a command and it acknowledges it, but then reboots and cuts off the IR transmission; not sure if this specific one can happen but this kind of edge case is quite likely), so it is much preferable to stop them from rebooting.

2 seconds every 3 minutes is about 1% packet loss; you seem to be getting about 0.25% packet loss, so maybe your devices recover faster than mine and this is why the problem affects you less than other people. Of course for some people the devices will recover more slowly due to their router/AP/DHCP server being slower, or the WiFi being more congested, or even maybe something like the Broadlinks scanning WiFi channels in order so that people on channel 1 get faster reconnects than people on channel 11 :-) (in fact, it pretty much needs to spend at least 100-200ms per channel during a channel scan to catch a beacon, so for 11 channels that's... 1-2 seconds!) Edit: and yeah, I'm on channel 11.

litinoveweedle commented 3 years ago

I will do it. But could you please do that last test (blocked/single network/check for updates in Home Assistant)? Because if we also have a binding problem, I want to fix it.

I can test this only to some level, please let me explain. I did tested it on simplest one LAN segment wi-fi router (and it was OK), but as I do not have testbed HA installation, only my home one with several network interfaces (and therefore several IP subnets). I am therefore not able to test the simplest combination (the one prevailing number of users will use) - one LAN segment with one IP subnet including both Broadlink devices and HA itself. But I remember seeing int he thread, that someone tested it and reported it was working.

If my complex network combination works (for choosing right bind interface and therefore correct IP src address for communicating with Broadlink device + keepalive packet to prevent Broadlink devices from rebooting), such simplest network/HA setup shall IMHO work without any problem. But I have no easy way to prove it, except find myself spare RaspberryPi and install another instance of HA. :-o

litinoveweedle commented 3 years ago

2 seconds every 3 minutes is about 1% packet loss; you seem to be getting about 0.25% packet loss, so maybe your devices recover faster than mine and this is why the problem affects you less than other people. Of course for some people the devices will recover more slowly due to their router/AP/DHCP server being slower, or the WiFi being more congested, or even maybe something like the Broadlinks scanning WiFi channels in order so that people on channel 1 get faster reconnects than people on channel 11 :-) (in fact, it pretty much needs to spend at least 100-200ms per channel during a channel scan to catch a beacon, so for 11 channels that's... 1-2 seconds!) Edit: and yeah, I'm on channel 11.

Wi-Fi re-association speed vary on many factors, including minimal wireless rate (it is always done on this rate), beacons rate, network congestion, device performance, etc... not speaking about control mechanism if you have roaming capabilities enabled, DHCP server response, etc... So that is something what could explain differences in Broadlink device measured availability. Also with default ping timeout 4sec and repetition 1sec on Windows OS, you got measurement error higher than period you are trying to measure (2-3sec). :-)

felipediel commented 3 years ago

I KNEW IT! bug

There is a binding issue! I managed to reproduce the error by creating a subnetwork.

marcan commented 3 years ago

Curious, what exactly did you do?

It's kind of weird that a binding issue would somehow affect the rebooting problem, unless it something like causing retries specifically to fail.

I'm quite confident that there's no VLAN related issue on the Broadlink side, but certainly having extra interfaces (regardless of whether they're VLANs or not) on the HA side could break some behavior there.

IDmedia commented 3 years ago

@marcan I noticed the same behaviour using a RM Pro+ on firmware 52. My provider uses som VLAN magic in order to seperate TV and Internet, not sure if that screws it up... I haven't blocked the device from accessing the Internet.

I would love to try the netcat command you came up with, but seems like nc on Home Assistant (running the official image on a Pi4) does not support the -b flag. Skipping it (still using -u) hangs forever.. is there a way to run this as a shell_command automation as a temporary fix?

felipediel commented 3 years ago

@IDmedia You just found another bug I was looking for! Your remote entity is being turned off when the remote becomes available. This is strange, because I cannot reproduce this issue. Could you give me some tip to trigger this behavior?

IDmedia commented 3 years ago

@felipediel Would love to help, but there's nothing I'm doing to trigger this. It just disconnected like clockwork... 2 min past every hour. The only time the it was on for a bit longer was when I tried rebooting it/moving it to another location in the house.

I haven't tested this device on another network yet, but I'm pretty certain that my ISP's setup/router is the issue somehow. I've had another Broadlink RM Pro+ (not sure of the fw right now) connected in another household and that has never disconnected. My coleague who has the same device has also never noticed this behaviour before. So the only variable I can see is my ISP/router. If you have anything you would like me to test/verify I would be happy to help.

marcan commented 3 years ago

@IDmedia I don't think VLANs on your provider side would affect anything either.

To be more precise, here's what I think is plausible and what isn't:

Not the problem:

VLANs purely on the network device side (switches, routers, etc). If neither the HA host nor the devices are VLAN-aware then there's no way for this to have an effect.
Further, VLANs affecting anything on WiFi, because VLAN tags aren't used on WiFi at all
VLANs on the WAN side of a router

Possible issues

Multiple interfaces (VLANs or not) on the HA server side can cause confusion with broadcast packet source IPs and destinations (this is the old binding problem we all know well)
I could see multiple interfaces having an effect on unicast communications if there is a bug somewhere in the way the sockets are used, which causes initial packets to go out right but subsequent packets to be treated differently (this would be strictly on the HA side, as WiFi devices have a single interface)
Multiple overlapped subnets on the same interface (VLAN or not). This isn't something you'd normally want to do, and introduces additional possible confusion with e.g. what subnet devices get from the DHCP server.
I could see some kind of weird bug happen on the Broadlink side of multiple subnets are overlapped too, though it's not super likely.
Multiple IPs on the HA host on the same interface and subnet. I'm actually doing this myself as of a week or so ago, not for particularly strong reasons. A priori there's a whole mess of deciding which source IP is used for packets, but in typical setup it shouldn't matter.

Pretty much all of these possible issues are very easy to debug with proper packet dumps taken on the HA host though.

Edit: Oh yeah, and a problem I've had with a similar presentation: I use the Bind DNS server on my HA host forwarding a LAN zone to my router's DNS server (except the IoT subzone which it manages) and it absolutely hates using a private TLD without DNSSSC signing and requires some settings and a crontab periodically renewing an NTA; a Bind update recently caused some domains to stop resolving every 5 minutes. Unlikely to be related to the other issues in this thread, but just to throw another "stuff that can go wrong" example out there :-)

IDmedia commented 3 years ago

In my case the setup is very simple. I'm running on a Raspberry Pi 4 and it's connected directly to the router by ethernet and WIFI has not been configured in any way... If my understanding is correct that should rule out most of the "multiple interface" issues as long as it's not trying to use WIFI even though it's not connected(?).

I'll try looking into getting the netcat command to work as a temp fix for now, but maybe I'll have to make an addon for HA in order to get the -bu flag to work properly. When I've access to the device again I'll also try upgrading the FW to 57 (if possible) and see if that fixes the issue.

marcan commented 3 years ago

@IDmedia your offline periods being oddly aligned on the hour is somewhat suspicious, it makes me think you have a different problem from the reboots issue... It could be your WiFi router causing it, or something else.

Packet captures would do a lot for debugging these issues :)

felipediel commented 3 years ago

@IDmedia You just found another bug I was looking for! Your remote entity is being turned off when the remote becomes available. This is strange, because I cannot reproduce this issue. Could you give me some tip to trigger this behavior?

Ah, wait a second, I know how this happened. Your remote entity was "off" all the time, that's why it goes from "unavailable" to "off".

Why it gets unavailable is something that deserves to be investigated. I'm pretty sure this is a different (but similar) problem. Note that it is every hour. There is a pattern in this madness.

IDmedia commented 3 years ago

I've tried upgrading to fw 57 using the Broadlink app and now it disconnectes every hour at 38 minutes instead, but it's only unavailable for 1 minutes instead of 5-6. Still annoying, but I didn't get any time debugging this further.

IDmedia commented 3 years ago

Small update. To my surprise I found out that I can change the IP/DNS now from the Home Assistant UI. I've tried changing my DNS from my ISP to 1.1.1.1 and for some strange reason it seems to be way more stable:

felipediel commented 3 years ago

Problem 1: Firmware design vs users expectations

The first problem is a design issue. When they made the firmware they didn't foresee that users might want to block the internet access intentionally. So the devices have no logic to handle icmp-admin-prohibited packets. When the device cannot reach the cloud, it "thinks" that something went wrong and tries to reconfigure the network in order to reach the cloud, even when the user's intention to block access is explicit.

This problem alone is not sufficient to make devices unavailable in Home Assistant, but the solution should definitely come from this point, after all, it is much easier to fight the initial cause than all its consequences. So you guys don't worry, I will implement the keep-alive mechanism on our side. This is out of discussion.

Problem 2: Bug in the recovery mechanism involving sub-networks

The recovery mechanism is not only a "reconnect to the WiFi" thing. The flow changes depending on the environment. I don't have access to the firmware, so I don't know exactly what is going on behind the scenes. All I know is: when there is a single network, the recovery is fast. When there are sub-networks, the devices take a long time to recover. The experiment is 100% reproducible.

Single network / Blocked -> Fast recovery / No errors

Simple network topology. Single network at 192.168.0.1. ISP router's (Pace C6500) DHCP server in my case.
Blocked outbound connections from the device on the firewall.
The device recovers fast.
No errors reported by the update coordinator in Home Assistant.

Multiple networks / Blocked -> Slow recovery / Errors

Complex network topology. I kept the previous network and created another network at 192.168.31.1 with an AX3600 router.
Blocked outbound connections from the device on the firewall again.
The device takes a long time to recover.
The update coordinator in Home Assistant reports errors.

So this is the "ghost" I was hunting all the time. I knew there is a binding issue. Today I checked socket by socket on our side and they are ok, which leads me to the conclusion that this is a firmware issue too. At a first glance it looks like a non-sense, after all these devices have a single network interface to bind(), not that easy to miss the target, right?

Well, maybe hanging around with you guys is making me paranoid, but I'm starting to think that these devices have another network interface. Perhaps something related to the FastCon technology? Are these devices communicating with each other on another level? I don't have the tool to sniff this (not something a simple WiFi analyzer can do), so I leave this question open. If this interface indeed exists, maybe there is a bind() issue in the firmware, similar to what we had on our side in the past.

Edit: Or maybe they’re doing some sort of network scanning that delays recovery. The truth is we can't precise exactly what is going on without looking at the code, so I'll stop trying to guess. I am just sharing what I know so you can take your own conclusions.

Problem 3: Aggressive recovery mechanism when there are multiple WiFi networks configured

This is another problem I found while I was testing the recovery mechanism. I provisioned my RM mini 3 with a new WiFi network I've created with the AX3600 in order to do some tests. After the tests I provisioned the device back with the old network. But somehow when I disabled the access to the cloud, the device reconnected to the AX3600 network again. And guess what? My RM pro did it too! I didn't tell my RM pro anything about that network. Did they share this information via Fastcon?

In this scenario, the network has changed, the IP addresses have changed, everything has changed when the devices failed to reach the cloud. So this is another example of how things can go wrong here.

Problem 4: Poor network conditions can trigger the recovery mechanism too

These are @IDmedia's conditions. He didn't block the devices on the firewall, but something was destabilizing his connection periodically, so the devices didn't receive "keep-alives" from the cloud and triggered the recovery mechanism. Now he picked another DNS and the problem has been remedied.

Conclusion

Our side is clean. The problem is the recovery mechanism (in the firmware). The best we can do is the non-invasive keep-alive mechanism proposed by @marcan. So I will just do it.

marcan commented 3 years ago

The first problem is a design issue. When they made the firmware they didn't foresee that users might want to block the internet access intentionally. So the devices have no logic to handle icmp-admin-prohibited packets. When the device cannot reach the cloud, it "thinks" that something went wrong and tries to reconfigure the network in order to reach the cloud, even when the user's intention to block access is explicit.

Just to restate this somewhat, it's a watchdog timer. The purpose is to reboot the whole device if something goes wrong and the cloud connection is down for too long. This makes perfect sense for people who want to use that feature, and I'm sure fixed a bunch of problems those users had. Treating icmp-admin-prohibited as "go away" wouldn't be the most reliable way of fixing this for local users like us, because that's still a protocol feature that can be accidentally triggered due to misconfiguration/etc, even transiently. IMO what they should've done is have an outright persistent configuration setting (e.g. as part of the Wi-Fi association packet) that just turns off the cloud stuff - not just the watchdog, but the connection attempts too, to save power and Wi-Fi traffic. But then again we all know Broadlink isn't interested in people not using their app/cloud stuff... so we're on our own here.

(Not arguing with you, just giving my opinion on how I'd handle this; it won't happen anyway so it's kind of moot :-) ).

The recovery mechanism is not only a "reconnect to the WiFi" thing. The flow changes depending on the environment. I don't have access to the firmware, so I don't know exactly what is going on behind the scenes. All I know is: when there is a single network, the recovery is fast. When there are sub-networks, the devices take a long time to recover. The experiment is 100% reproducible.

I hate to hit on this point again, but to the Broadlink, there is a single network here. My Broadlink devices are attached to an SSID that has a single IP subnet on a them. As far as the devices are concerned it is not different from any other dumb router with a single SSID. The fact that there are VLANs on the wire behind the AP, or that other SSIDs are being broadcast too, is not something the device can differentiate. It is not different than two neighbors with disconnected, unrelated SSIDs.

What the Broadlink devices do is simple. They reboot. The same function gets called that also gets called when there is a fault or some other unrecoverable condition. It's the same thing as unplugging them and plugging them back in. I'm staring at the decompiled firmware here :-)

Multiple networks / Blocked -> Slow recovery / Errors

Complex network topology. I kept the previous network and created another network at 192.168.31.1 with an AX3600 router.

What do you mean by "created another network"? Created an isolated, standalone Wi-Fi network with a different SSID? Or bridged together two subnets into the same L2 network via Ethernet? In the latter case, did you have one or two DHCP servers?

If you have two IP subnets on the same network with separate DHCP servers, then of course devices are going to fight over which one they get an IP from, which is going to cause failures and timeouts. This is natural, it's not a device bug, that would just be a broken network setup.

Edit: Or maybe they’re doing some sort of network scanning that delays recovery. The truth is we can't precise exactly what is going on without looking at the code, so I'll stop trying to guess. I am just sharing what I know so you can take your own conclusions.

If you mean creating a separate isolated SSID that the device has no knowledge of and ought to ignore, then the only reason that would slow things down is due to radio congestion. But then we're back to this having to do nothing to do with binding or VLANs, it's just a general "too much stuff on the air makes Wi-Fi slow" issue, which applies regardless of whether it's the same person running two SSIDs, or just your neighbors. I mean, I run 2 SSIDs on 2.4GHz myself, but obviously my neighbors have some networks too :-)

Problem 3: Aggressive recovery mechanism when there are multiple WiFi networks configured

This is another problem I found while I was testing the recovery mechanism. I provisioned my RM mini 3 with a new WiFi network I've created with the AX3600 in order to do some tests. After the tests I provisioned the device back with the old network. But somehow when I disabled the access to the cloud, the device reconnected to the AX3600 network again. And guess what? My RM pro did it too! I didn't tell my RM pro anything about that network. Did they share this information via Fastcon?

In this scenario, the network has changed, the IP addresses have changed, everything has changed when the devices failed to reach the cloud. So this is another example of how things can go wrong here.

This is surprising. I don't think they would share info via any back channel, and I would think they only support a single set of Wi-Fi configs, but maybe they can have multiple? In that case it would make sense that after a reboot, they would end up associating to a network at random, of all the configured ones.

I've never connected my units to a network that isn't the single IoT one they are supposed to be on, so this doesn't apply to me...

Edit: did you use SmartConfig to set up the new WiFi network, or AP mode? SmartConfig is a broadcast based solution, so it is possible that devices that you didn't intend to configure picked up the network details too...

Problem 4: Poor network conditions can trigger the recovery mechanism too

These are @IDmedia's conditions. He didn't block the devices on the firewall, but something was destabilizing his connection periodically, so the devices didn't receive "keep-alives" from the cloud and triggered the recovery mechanism. Now he picked another DNS and the problem has been remedied.

Certainly, this is possible. It could also be an unrelated issue though (not the devices rebooting, but the bad DNS somehow causing something else to make HA not be able to talk to the devices - bad DNS tends to have many weird effects). We'd need packet logs to be sure of what happened...

Really, this is why I keep hammering on packet logs - because it's very easy to end up drawing strange conclusions from just doing trial and error experiments and observing whether the devices are stable or not. But once you have a packet log, you know exactly what is going on.

Conclusion

Our side is clean. The problem is the recovery mechanism (in the firmware). The best we can do is the non-invasive keep-alive mechanism proposed by @marcan. So I will just do it.

There's actually an advantage to doing this. This way the watchdog timer benefits us too. In other words, if the devices crash or the Wi-Fi AP does something weird or connectivity breaks for some reason, the keep-alives will cease to arrive and the device will reboot... which is exactly what you want. We will be taking advantage of the watchdog mechanism to increase reliability and automatic failure recovery in HA setups too.

felipediel commented 3 years ago

Just to restate this somewhat, it's a watchdog timer. The purpose is to reboot the whole device if something goes wrong and the cloud connection is down for too long. This makes perfect sense for people who want to use that feature, and I'm sure fixed a bunch of problems those users had.

Yes, I understand its necessity, without the watchdog timer the devices would lose connection to the cloud without realizing it. What I was suggesting is that they should have left a gap for us. But you know what? I think they did. This packet is ok. And now we have a local watch dog. Thinking about this, this is really cool.

Treating icmp-admin-prohibited as "go away" wouldn't be the most reliable way of fixing this for local users like us, because that's still a protocol feature that can be accidentally triggered due to misconfiguration/etc, even transiently. IMO what they should've done is have an outright persistent configuration setting (e.g. as part of the Wi-Fi association packet) that just turns off the cloud stuff - not just the watchdog, but the connection attempts too, to save power and Wi-Fi traffic. But then again we all know Broadlink isn't interested in people not using their app/cloud stuff... so we're on our own here.

(Not arguing with you, just giving my opinion on how I'd handle this; it won't happen anyway so it's kind of moot :-) ).

Can't disagree with this "turn off the cloud" button. And I agree that it won't happen (at least in the official app).

I hate to hit on this point again, but to the Broadlink, there is a single network here. My Broadlink devices are attached to an SSID that has a single IP subnet on a them. As far as the devices are concerned it is not different from any other dumb router with a single SSID. The fact that there are VLANs on the wire behind the AP, or that other SSIDs are being broadcast too, is not something the device can differentiate. It is not different than two neighbors with disconnected, unrelated SSIDs.

What the Broadlink devices do is simple. They reboot. The same function gets called that also gets called when there is a fault or some other unrecoverable condition. It's the same thing as unplugging them and plugging them back in. I'm staring at the decompiled firmware here :-)

This makes sense. But you said that your devices are rebooting every 3 minutes and I can reproduce conditions in which mine does not. The blue LED doesn't blink. There is an "if statement" in this watch dog that we did't know. And I just found it.

So let me do a mea culpa first. The problem is not specific to VLANs. The word VLAN just got in the way here. I also think I rushed to say it was something related to bind(). I was coming from a similar issue, so I was a little suspicious, I had to double check socket by socket to see with my own eyes.

This week I got some tech and I did some experiments. And the problem is UPnP. This is why I was being able to reproduce the issue by creating subnets. It was never about the subnets. It was about connecting a router to a UPnP router. The topology doesn't matter. UPnP + blocked devices is what triggers recovery. So this is the "bug in the matrix" I was looking for. If you disable UPnP, you won't need the keep-alive pills. But of course I'm not going to ask you to do so.

This is surprising. I don't think they would share info via any back channel, and I would think they only support a single set of Wi-Fi configs, but maybe they can have multiple? In that case it would make sense that after a reboot, they would end up associating to a network at random, of all the configured ones.

Edit: did you use SmartConfig to set up the new WiFi network, or AP mode? SmartConfig is a broadcast based solution, so it is possible that devices that you didn't intend to configure picked up the network details too...

Yes, this is what happened. I was starting to get worried. Still aggressive because there is no "forget" option in the app, so I had to factory reset the devices to get rid of the extra WiFi network.

I did my research and didn't find much about FastCon, only what we see in that video. But what I could understand is that it is a WiFi based technology with broadcast support, so I think that's what we've been using all the time hahahah

marcan commented 3 years ago

I don't have UPnP enabled anywhere :-)

(non-bold config tree entries are empty/nonexistent, and by default that means off for these services).

I did some grepping and didn't see any mention of FastCon in the RM3 firmware, so I get the feeling it isn't supported in these devices.

I've looked through the firmware codepath for the watchdog, and I didn't notice any branch that could stop it from triggering other than the aforementioned keepalive packet (which is how I found it).

The main network handling thread (including packet rx, reconnections, and the watchdog) is a big infinite while loop that ends like this:

The only thing that sets the app network status to 8 is receiving the keepalive packet, and that gets cleared by a different branch of the code earlier after another timeout.

felipediel commented 3 years ago

Are you looking at v57? What comes before this part? Is there a chance to have something like this?

if (!upnp) {
    continue;
}

Edit: or maybe something inside FUN_00102880(500)? Edit2: I think if (iVar6 != 8) may be if (upnp).

marcan commented 3 years ago

This is v57. There are no continue statements in the loop. This is the thread in charge of receiving packets from the network, so without this thread nothing would work, not even local comms. iVar6 comes from app_network_status_get and as I said that only gets set to 8 when a keepalive packet arrives.

alunarkid commented 3 years ago

Has this been solved?

IDmedia commented 3 years ago

No, there are still issues. It was working for me for a while, but now it's back to disconnecting every hour like clockwork.

marcan commented 3 years ago

Were clearly combining multiple issues here. The 3-minute restarts due to cloud connectivity should be fixed (at least I haven't seen any weirdness since I have the script running; haven't tried updating HA yet). The hourly glitches obviously have some other cause.

felipediel commented 3 years ago

The PR has not yet been merged into the library, but that's okay, we’re in no hurry. We’re still implementing device discovery here, and then we’ll be able to reuse a significant amount of code (the logic is very similar).

felipediel commented 3 years ago

@IDmedia Just out of curiosity, if you disable UPnP in your router settings, does the problem persist?

IDmedia commented 3 years ago

@IDmedia Just out of curiosity, if you disable UPnP in your router settings, does the problem persist?

Yes, I can confirm that this issue persist with AND without UPnP. I've even tried multiple RM Pro+ I have lying around and act the same. No other device has any noticeable issue on the network. It's in my parents home, so I'm limited by the ISPs router with barely any settings and for the time being it's hard to devote time to debug the issue further unfortunately.

felipediel commented 3 years ago

Ok. Your problem looks a little different than what I can reproduce here, but I hope the update will fix it too.

Just in case you want to test or temporarily fix it:

Install this branch on your PC.

pip3 install git+https://github.com/felipediel/python-broadlink.git@keep-alive --upgrade

Run this script.
```
import broadlink as blk
import time
```

while True: blk.keep_alive("192.168.0.17", 80) # Example device time.sleep(120)



Edit: Ah, about that on/off button, there was indeed a bug. We already fixed it.

IDmedia commented 3 years ago

For the time being I'm limited to remote access to the Raspberry Pi running Home Assistant. I intend to try the code once I'm able to, but in the meantime I have tried running the nc-command and running your code in a custom HA addon, but none of the solutions seem to work. The broadlink is still disconnecting every hour.

I noticed that busybox nc does not include the -b flag, but I've tried this without any success: echo -ne '\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x01\0\0\0\0\0\0\0\0\0' | nc -uv 192.168.1.80 80 -w 1

I also made these three files, put them into the addons/ directory and started the custom addon, but that also didn't seem to work. As far as I can tell this is the same as running your code above: config.json:

{
  "name": "Broadlink Keep Alive",
  "version": "1",
  "slug": "broadlink_keep_alive",
  "description": "Keep Broadlink from resetting",
  "arch": ["armhf", "armv7", "aarch64", "amd64", "i386"],
  "startup": "application",
  "boot": "auto",
  "options": {},
  "schema": {}
}

Dockerfile:

ARG BUILD_FROM
FROM $BUILD_FROM

ENV LANG C.UTF-8

# Install requirements for add-on
RUN apk add --no-cache python3

# Copy data for add-on
WORKDIR /
COPY main.py /

CMD [ "python3", "-u", "main.py" ]

main.py:

import socket
from time import sleep

while True:
    print('Sending keep alive')

    with socket.socket(socket.AF_INET, socket.SOCK_DGRAM) as conn:
        conn.setsockopt(socket.SOL_SOCKET, socket.SO_BROADCAST, 1)
        packet = bytearray(0x30)
        packet[0x26] = 1
        conn.sendto(packet, ('192.168.1.80', 80)) #Static IP of my Broadlink
    sleep(120)

wipeout666 commented 3 years ago

EDIT: After a bit of search, it seems there's a branch for a "heartbeat" function: https://github.com/felipediel/home-assistant/tree/heartbeat. Is this something that fixes the problem described in this thread, namely the reboot issue when no internet is present for RM mini 3 ?

Any updates on this? I don't feel good keeping internet open on my two broadlink devices. If I enable parental control, they drop Wi-Fi (I assume they reboot) every x minutes. Can't remember the exact frequency, but frequent enough to drop commands that it should really process like TV, heatpump, stuff like that.

Thanks for your contribution!

felipediel commented 3 years ago

@wipeout666 The update is in the queue, but it will take a while. For now, you can use the Dockerfile written by IDmedia, it looks great.

@Silvenga The original issue of the topic is fixed and the unrelated discussion that took place here turned into a library update. Mind closing the issue?

wipeout666 commented 3 years ago

IDMedia's script unfortunately has no effect on the frequent reboots every 2 mins when internet is blocked on my broadlink RM 2 devices.

Any ideas what to try next ?

felipediel commented 3 years ago

I have an idea. Let's try something crazy.

Step 1

>>> import broadlink as blk

>>> d = blk.hello('10.0.0.185')
>>> d.auth()
>>> d.decrypt(d.send_packet(0x68, b'')[0x38:])

What is the output?

wipeout666 commented 3 years ago

Same output

felipediel commented 3 years ago

I need the output from d.decrypt(d.send_packet(0x68, b'')[0x38:]). Forget about the keep alive pills for now, I want to try something different. Do it on the Python shell or add a print before if you are running it as a script:

print(d.decrypt(d.send_packet(0x68, b'')[0x38:]))

Please send me as text, I need to decode the result.

wipeout666 commented 3 years ago

You'll have to guide me there, I'm not too comfortable with Python.

I opened a python shell with "Python3", then I proceed to type, at the ">>>" prompt, the following: "import broadlink as blk". This proceeds to run the infinite loop of keepalive inside broadlink.py. I don't have the ability of entering more shell commands.

Is that the intent?

Breaking the inifinite keepalive loop with ctrl+c breaks the script. When I continue with

d = blk.hello('10.0.0.185')

I get: Traceback (most recent call last): File "", line 1, in AttributeError: module 'broadlink' has no attribute 'hello'

That indicates that I'm a newbie and I'm missing the obvious, but I can't figure it out ...

felipediel commented 3 years ago

You have an old version installed. I rewrote the code to make it compatible:

>>> import broadlink as blk

>>> devs = blk.discover(timeout=10)
>>> print(devs)

You should see a list with your devices now. Select the device you want by index:

>>> d = devs[0]  # Example device
>>> d.auth()
>>> print(d.decrypt(d.send_packet(0x68, b'')[0x38:]))

What is the output?

Edit: Forget about the keep alives. This is something different, remove the previous code. I just need to see the result so I can provide a possible fix.

wipeout666 commented 3 years ago

I was running this on another raspberry that was running Raspbian and not hassio. Based on your comment "You have an old version installed", I realized this was meant to be run within HA.

I run this in the Terminal, but I guess that's not where it should be run ?

felipediel commented 3 years ago

If you want to do it like this you need to execute the interactive bash on the container first.

docker exec -it homeassistant /bin/bash
python3

You could also install Python 3 on your computer and install the library with pip3 install broadlink. This is what I would do.

wipeout666 commented 3 years ago

Getting there .... !

I have two RMs, 10.0.0.185 is the one I want to test on, so I assume the index is [1], correct ?

wipeout666 commented 3 years ago

Also tried with the first code you posted, since I installed the latest broadlink module.

Output is the same, empty:

felipediel commented 3 years ago

There is an error in the response probably. Let's see what is going on:


>>> import broadlink as blk
>>> d = blk.hello('10.0.0.185')
>>> d.auth()
>>> r = d.send_packet(0x68, b'')
>>> blk.exceptions.check_error(r[0x22:0x24])
>>> print(d.decrypt(r[0x38:])

wipeout666 commented 3 years ago

felipediel commented 3 years ago

It looks like they removed this command in the new firmware :( If you remove the device from the official app, do the reboots persist?

wipeout666 commented 3 years ago

The official app was never installed with these devices. Straight to Home Assistant when I got them.

felipediel commented 3 years ago

How did you provision?

home-assistant / core