Broadlink Component - Handling Of Communication Errors

Silvenga commented 4 years ago

The problem

It appears that the Broadlink component will not process commands after the remote device is marked as unavailable.

Environment

Home Assistant Core release with the issue: 0.112.0
Last working Home Assistant Core release (if known): Unknown.
Operating environment (OS/Container/Supervised/Core): Supervised
Integration causing this issue: homeassistant.components.broadlink
Link to integration documentation on our website: https://www.home-assistant.io/integrations/broadlink/

Problem-relevant `configuration.yaml`

remote:
- platform: broadlink
  host: 192.168.2.144
  mac: blah
  type: rm4_mini
  name: Servers Broadlink
sensor:
- platform: broadlink
  host: 192.168.2.144
  mac: blah
  type: rm4_mini
  name: Servers Broadlink
  scan_interval: 60
  monitored_conditions:
    - temperature
    - humidity

Traceback/Error logs

cat home-assistant.log | grep broad
2020-07-01 21:19:34 INFO (SyncWorker_36) [homeassistant.loader] Loaded broadlink from homeassistant.components.broadlink
2020-07-01 21:19:34 INFO (MainThread) [homeassistant.components.remote] Setting up remote.broadlink
2020-07-01 21:19:37 INFO (MainThread) [homeassistant.components.sensor] Setting up sensor.broadlink
2020-07-01 22:01:21 WARNING (MainThread) [homeassistant.components.broadlink.device] Disconnected from device at 192.168.2.144: Control key is expired
2020-07-01 22:02:17 WARNING (MainThread) [homeassistant.components.broadlink.device] Connected to device at 192.168.2.144
2020-07-01 23:31:47 WARNING (MainThread) [homeassistant.components.broadlink.device] Disconnected from device at 192.168.2.144: Control key is expired
2020-07-01 23:32:39 WARNING (MainThread) [homeassistant.components.broadlink.device] Connected to device at 192.168.2.144
2020-07-02 01:02:10 WARNING (MainThread) [homeassistant.components.broadlink.device] Disconnected from device at 192.168.2.144: Control key is expired
2020-07-02 01:02:10 ERROR (MainThread) [homeassistant.components.broadlink.remote] Failed to send 'fan only/server ac': The device is offline
2020-07-02 02:32:25 WARNING (MainThread) [homeassistant.components.broadlink.device] Disconnected from device at 192.168.2.144: Control key is expired
2020-07-02 02:33:20 WARNING (MainThread) [homeassistant.components.broadlink.device] Connected to device at 192.168.2.144
2020-07-02 05:23:55 WARNING (MainThread) [homeassistant.components.broadlink.device] Disconnected from device at 192.168.2.144: Control key is expired
2020-07-02 05:24:51 WARNING (MainThread) [homeassistant.components.broadlink.device] Connected to device at 192.168.2.144
2020-07-02 09:10:22 WARNING (MainThread) [homeassistant.components.broadlink.device] Disconnected from device at 192.168.2.144: Control key is expired
2020-07-02 09:11:15 WARNING (MainThread) [homeassistant.components.broadlink.device] Connected to device at 192.168.2.144

Additional information

Note that the sensor module still operates every 60 seconds, which suggests the device was accessible after 2020-07-02 01:02:10 (as can be seen by the WARNING logs, which is a known issue with minimal impact).
Restarting the device has no effect.
Restarting HA seems the only mitigation.
HA reports the device was unavailable since 2020-07-02 01:02:10, which suggests to me that being unavailable was the cause. Manually instructing HA to send commands has no impact when the device is unavailable.
Checking the logs of the Wifi Access Point, the broadlink disconnects and reconnects 20 times an hour. This appears to be normal (roaming from one access point to another).
This is reproducible on demand.
- Send command to verify connectivity.
- Disconnect Broadlink device from power.
- Send command (which will fail).
- Reconnect and allow Broadlink device to power up.
- Send command and note that the command is never attempted (note that no communication error is logged and the device is marked as unavailable).
- Restart HA and note that sending commands is now possible.

felipediel commented 3 years ago

That's ok 😄

litinoveweedle commented 3 years ago

So both RM mini 3 upgraded to v57, for RM pro+ there is no new FW then original v43. Unfortunately after 5minuts with firewall rule blocking internet access there are again disconnecting from Wi-Fi.

Are you really sure, that you firewall is correctly set to block these connections? If it is statefull firewall to be really sure conntrack table has to be purged (easiest way is to reboot the router) after rule is changed...

felipediel commented 3 years ago

Yes. They are 100% isolated. No errors.

litinoveweedle commented 3 years ago

I presume you did use "ihc" Android app to connect RM to network? Could you please check under "Device Info" status of the devices?

I have in mine: .... Access to cloud: Load Failed. Tap to retry ... Connection status: Local online Data cloud: Load failed. Tap to retry SDK: Load failed. Tap to retry

felipediel commented 3 years ago

Mine looks just like yours. Completely isolated. No errors.

litinoveweedle commented 3 years ago

Ok, I have no other ideas.... :-o From yesterday:

Logger: homeassistant.components.broadlink.updater
Source: helpers/update_coordinator.py:165
Integration: Broadlink (documentation, issues)
First occurred: October 21, 2020, 4:22:57 AM (57 occurrences)
Last logged: 12:45:38 AM

    Error fetching device data: The device is offline
    Error fetching device data: [Errno -4000] Network timeout

felipediel commented 3 years ago

Please show me your firewall rule for rejecting packets. Does it work if you add reject-with=icmp-admin-prohibited?

litinoveweedle commented 3 years ago

Results are same with both Drop (silent) or Reject, but for reject I use ICMP type 'network unreachable' (that is normally default). I will test ICMP type 'admin prohibited' as well.

litinoveweedle commented 3 years ago

firewall rule applied, RMs restarted, no change - still disconnecting, except now interval shortened to exactly 3minutes!

The firewall rule (Mikrotik) itself is pretty simple:

chain=forward action=reject reject-with=icmp-admin-prohibited src-address=192.168.3.0/24 out-interface-list=inet log=no log-prefix="" Do your RM have possibility to use your local DNS server (probably on your router) for name resolution, or this connection is as well not allowed?

felipediel commented 3 years ago

What about reject-with=icmp-net-prohibited? I'm starting to run out of ideas too. A simple drop should work. That's what I'm using.

litinoveweedle commented 3 years ago

Ok the 3 vs 5min period is the firmware. With v57 the RM mini 3 are now at 3minutes, the RM pro+ keep reseting on 5min. So with never firmware I made it actually worse.

Do you see in your firewall dropped connections from RM to internet? Could you log it to see if it try to connect?

felipediel commented 3 years ago

Nope. My firewall is really simple, I don't have all these options. But I just realized something. I am filtering outbound connections. You are filtering inbound connections. Try replacing src-address=192.168.3.0/24 with dst-address=192.168.3.0/24.

Edit: I know this is not the same if you are concerned about privacy. But there is nothing we can do in our code base. We can stop the error messages by creating a "no polling/assumed state" option. I will do it. But the real problem will persist.

This is a firmware issue. You are using the device in a way that they did not foresee. If you want to solve this problem for real, the only way out is their support team.

litinoveweedle commented 3 years ago

But I just realized something. I am filtering outbound connections. You are filtering inbound connections. Try replacing src-address=192.168.3.0/24 with dst-address=192.168.3.0/24

nope, I am filtering outgoing (outbound) connection from RMs int internet as well. Please check my rule:

chain=forward action=reject reject-with=icmp-admin-prohibited src-address=192.168.3.0/24 out-interface-list=inet log=no log-prefix=""

Means DROP packets FROM (src-address) 192.168.3.0/24 (where RM are) TO (out-interface) Internet. Also with stateful firewall blocking TCP inboud or outbound connection will not matter as there is no difference in Droping in/out packet silently. Simply on TCP protocol device will not even establish TCP session as in both cases ACK will never arrive - so device has no chance at all to even know, if you block outbound connection (dropped SYN) and therefore no inbound ACK was received or permitted outgoing SYN and dropped inbound ACK. ;-)

This is a firmware issue. You are using the device in a way that they did not foresee. If you want to solve this problem for real, the only way out is their support team.

Well using it offline is very often the case people with HA or OpenHab would prefer or intent - for same reasons like me. But I understand your point (that this seems to be firmware issue) and I was scared the whole time, that I will have to bring this to the Broadlink support - I expect the worse debate possible, as it is very often impossible to get the same level of understanding like for example here. :-)

But I am still stunned why with same FW you are simply not having this issue. There has to be some external factor we missed. Identifying it would significantly increase our chances for this issue being solved by Broadlink support.

felipediel commented 3 years ago

The only thing left I can imagine at this point is your subnet. Perhaps they are binding the socket to the wrong interface.

About the rule (change inbound to outbound), I undestand your logic, but we are trying to find something not logic. So please, don't rule out the possibility without trying.

litinoveweedle commented 3 years ago

Please check your previous statement:

I am filtering outbound connections.

I do as well for whole time! So to have same understanding, which direction are you referring to? Please check my original rule and compare to yours:

chain=forward action=reject reject-with=icmp-admin-prohibited src-address=192.168.3.0/24 out-interface-list=inet log=no log-prefix=""

Means DROP packets FROM (src-address) 192.168.3.0/24 (where RM are) TO (out-interface) Internet. Do you have SAME or opposite rule? As mine is Outbound rule (to internet), but you are referring to it as Inbound rule. There seems to be some confusion here.

felipediel commented 3 years ago

I mean the opposite. Change the rule to filter outbound connections. So instead of src-address=192.168.3.0/24 you use dst-address=192.168.3.0/24.

litinoveweedle commented 3 years ago

I mean the opposite. Change the rule to filter outbound connections. So instead of src-address=192.168.3.0/24 you use dst-address=192.168.3.0/24.

What you say doesn't make any sense, I am sorry, please check your statement!

Outbound is FROM RMs -> SRC is 192.168.3.0/24 Inbound is TO RMs -> DST is 192.168.3.0/24

I CAN'T change src-address=192.168.3.0/24 to dst-address=192.168.3.0/24 to make it OUTBOUND! It will be INBOUND if I make such change!

I think we have confusion to the direction we are refering to, when speaking about INBOUND vs OUTBOUND. (is it TO the internet or TO devices?)

felipediel commented 3 years ago

Just change the rule 🤣

felipediel commented 3 years ago

Now even I got confused HAHAHAH

I am not a native, so maybe I expressed myself wrong. I mean drop packets coming from the outside world.

litinoveweedle commented 3 years ago

I am sorry, but what you are proposing MAKE no sense for simple reason:

you router have simple statefull firewall.
if you try to block incoming NEW connection from the Internet it will STILL allows for RM devices to create and establish NEW connection TO the Internet

Jut to be sure I tested your rule. It helps for simple reasons - devices WILL CONNECT TO CLOUD! Your rule is wrong as it will simple NEVER block ANY connection which is started by RM. (And yes, that means that my logic here was wrong). The reason is simple:

Any statefull firewall (including any simple home router) will allow RELATED packet to pass trough. So if you allow RM to start/open TCP connection TO the Internet, then ANY packet related to this connection will PASS disregard your rule for direction from Internet. The rule itself will only block connection originated from the Internet, which will never reach your RM anyway at least not without port mapping (as your devices are behind NAT)

litinoveweedle commented 3 years ago

Now even I got confused HAHAHAH

I am not a native, so maybe I expressed myself wrong. I mean drop packets coming from the outside world.

And that is where you have issue! You are not dropping packets in statefull firewall, you are allowing/dissallowing NEW connections.

What you thing you are doing

RM device sent TCP SYN -> Broadlink cloud - PASS Broadlink cloud sent TCP ACK -> RM device - DROPPED by your rules

What statefull (any simple linux based) home router do in reality

RM device sent TCP SYN -> Broadlink cloud - PASS Broadlink cloud sent TCP ACK -> RM device - PASS as it is related to allowed connection

The reason is, that packet will never reach your rule to be DROPPED, as any statefull firewall has rule to allow RELATED packets as the first one in the line. This is order of the rules in my router, but you have the same in your albeit you do not see them:

................

12    chain=forward action=fasttrack-connection connection-state=established,related log=no log-prefix="" 

13    chain=forward action=accept connection-state=established,related log=no log-prefix="" 

.................

23    chain=forward action=drop src-address=192.168.3.0/24 out-interface-list=inet log=no log-prefix=""

If you change the rule 23 to

23 chain=forward action=drop dst-address=192.168.3.0/24 in-interface-list=inet log=no log-prefix=""

then any connection which will be originated by any device with IP in subnet 192.168.3.0/24 will be allowed!

felipediel commented 3 years ago

I can't see the rule in mine, I can only see the user interface. I am trying to translate the interface to rules that you can apply to yours, but I think I got confused with the term outbound (the only option I have in mine). So my configuration looks just like yours was before. I am filtering outbound connections and my devices are completely isolated from the internet.

So the only possibility now is your subnetwork. I think the device is binding the socket to the wrong interface. Try connecting these devices to 192.168.0.x.

litinoveweedle commented 3 years ago

So the only possibility now is your subnetwork. I think the device is binding the socket to the wrong interface. Try connecting these devices to 192.168.0.x

I even do not have network 192.168.0.0. RMs are given IP addresses of the 192.168.3.0/24 by DHCP. And I see connections from RMs with these source addresses in my log:

12:21:45 firewall,info forward: in:vlan30 out:ether6, src-mac cX:XX:XX:XX:XX:X1, proto TCP (SYN), 192.168.3.89:49157->18.197.38.168:80, len 44 
12:21:45 firewall,info forward: in:vlan30 out:ether6, src-mac cX:XX:XX:XX:XX:Xa, proto TCP (SYN), 192.168.3.92:49157->18.197.38.168:80, len 44

(these are packet/connections logged as being dropped) So devices are communicating with correct IPs. (In fact I am using VLANs to physically separate my smart devices, not only by using different IP subnet in the same ethernet segment like you)

felipediel commented 3 years ago

The interface was correct when the device sent these packets. The problem is happening after the packets being dropped, so it won't appear in the logs.

I think the problem is related to your VLANs. This is the only difference and @callifo is having the same problem.

marcan commented 3 years ago

FWIW, I'm seeing reconnects every 3 minutes here too, on two RM3 minis. I suspected the "I can't talk to the cloud so I'll restart" cause too... I'm now trying things to see if I can convince it to give up.

So far:

The RM3 tries to use the DNS server provided via DHCP, or, if none, OpenDNS
It tries to resolve 10039activ.ibroadlink.com.
If resolution times out, it reassociates after 3 minutes
If it gets an NXDOMAIN, it continues retrying every 10 seconds and restarts after 3 minutes.
If it gets 127.0.0.1, it continues retrying too.
If I give it a real IP I control, it tries to connect on port 80
If that connection times out (TCP port not open), it retries, sending SYN packets for 20 seconds, then retrying the whole thing before reconnecting to the WiFi again after 3 minutes.
If I reply with ICMP port unreachable, it just ignores those replies. It does not implement ICMP properly. Same as dropping the packets.
If I go with TCP RST rejections, it retries much more frequently, more than once a second. Still gives up and reassociates after 3m.
If I set up an HTTP server that just returns 404s, it still retries more than once a second. Also, it does not use the domain for the Host header, but the IP, so you need to set the IP as the vhost name if your web server serves multiple domains.
Returning empty 200s also does not work.

At this point I'm going to have to let them talk to their cloud service to see what they actuallly want, but it's clear that none of the obvious blocking solutions are working here.

litinoveweedle commented 3 years ago

@marcan Thank you for confirmation :-) In adition, after TCP connection is established, it seems that RMs will continue to maintain communication only over UDP. In my case (EU) cloud seems to be hosted at Amazon EC2, so target IP are different.

@felipediel regarding the VLAN it makes no sense - VLAN is only separate ethernet segment - think about it as dedicated switch connected to your router - on L2 level only, nothing to do with >= L3 level.

Edit: I have raised ticket to Broadlink via email and also tried to called them. It was pathetic call. :-p

felipediel commented 3 years ago

@litinoveweedle All users who are experiencing this problem are using VLANs. So it makes a lot of sense to me. We need to believe what our eyes are seeing in order to understand what is going on.

Each VLAN has its own special broadcast address. If the device sends messages to 255.255.255.255 it gets no response. In your case, the message should be sent to 192.168.3.255. Perhaps this is what they are doing wrong in their recover logic.

@marcan Thanks for the detailed information. The obvious solution is implementing an additional logic to handle ICMP packets. If the device gets reject-with=icmp-admin-prohibited, the user blocked the connection intentionally, so the device should skip the reattempts. This can only be done in the firmware. Good luck with their support, I hope they fix it soon.

litinoveweedle commented 3 years ago

@litinoveweedle All users who are experiencing this problem are using VLANs. So it makes a lot of sense to me. We need to believe what our eyes are seeing in order to understand what is going on.

Each VLAN has its own special broadcast address. If the device sends messages to 255.255.255.255 it gets no response. In your case, the message should be sent to 192.168.3.255. Perhaps this is what they are doing wrong in their recover logic.

I am sorry but this simply doesn't work as you describe. It actually woks as if you had only one Ethernet segment, with only one IP subnet. So if I have simplest network situation - for example Wi-Fi router with just one subnet 192.168.3.0/24 on the LAN, than the result will be completely same. Broadlink device has no way to identify if VLANs are used to isolate given ethernet subnet or it is done by dedicated hardware device (switch/bridge located in your Wi-Fi router). Also Broadlink device has no way of knowing if there are any other either Ethernet segments or IP subnets except the one it is in. Please search some info about how 802.1Q and network segmentation works if you do no believe me.

Your logic has gap, paraphrasing it: People believe in ghosts, because they think they see them. This doesn't means ghost exists and we shall try to understand them. ;-)

Honestly I think, that it is just coincidence, as only rather power users having segmented network with intentions to block given segments from communicating with network. ;-)

felipediel commented 3 years ago

If there is no ghost, why are you asking me for help? The logic works exactly as I described. Each VLAN has its own special broadcast address. This is networking 101. If the device binds the socket to the wrong interface, any message sent to 255.255.255.255 will get no response. I am 99% sure that's the problem. Just copy and paste this message to their support team and they will know what to do.

marcan commented 3 years ago

@felipediel he's right, please stop making stuff up about VLANs. VLANs behave the same as any normal isolated Ethernet network. The only correlation here is that there is a big overlap between the kind of geek paranoid enough to firewall IoT devices from the Internet and the kind of geek who happens to know about VLANs, and they are an obvious solution to this problem. Yes, I use VLANs too, and I am absolutely confident they have nothing whatsoever to do with this problem. Networking is one of my jobs, I know what I'm doing here.

As far as the devices involved are concerned, the Broadlink devices and one Ethernet (sub)interface on my Home Assistant server (which also handles DHCP/DNS/routing duties for this segment) are on the same isolated network segment, and the fact that VLANs are involved is completely irrelevant to them.

marcan commented 3 years ago

If we're playing the "simplest explanation" game... since apparently all of us are having this issue except @felipediel, my Occam's Razor diagnosis is that his firewall might not be set up properly and he is, in fact, letting them talk to the broadlink cloud :-)

It's clear these things really want to talk to the Internet; @felipediel if you truly believe it works fine for you and they don't reconnect, then what we need to move forward is a complete packet log of a broadlink on wifi, from startup through 5-6 minutes, to see what it is that you're doing that the rest of us aren't that convinces it to not drop off. I've already tried everything I could think of (and have been looking at tcpdump as I did to prove I was doing what I thought I was).

Barring that, there's two things to be done here:

Make sure HA's integration has long enough cmd/response timeouts to ride out one of the reconnect phases without more impact than a delay
Reverse engineer the firmware to figure out if there is some explicit codepath to trigger it to stop doing this.

marcan commented 3 years ago

The latest firmware can bet fetched via:

https://fwversions.ibroadlink.com/getfwversion?devicetype=<device type in decimal>

E.g. for the RM3 minis I have:

https://fwversions.ibroadlink.com/getfwversion?devicetype=10039

The latest version is 57, but by scraping URLs I found versions 47, 48, 55, 56 are also available. v57 has this message:

[APP][%s:%d] cloud health timeout will reboot

But v56 does not. This suggests that a downgrade to v56 might fix the problem.

Silvenga commented 3 years ago

What's the issue here (lots of comments here)? Are we questioning if the Broadlink's disconnects from wifi, or just suggesting improvements in the offline detection?

I've always blocked these Broadlink devices from WAN access (physical address level, not the whole vlan). I also have always polled the devices (polls every 10 seconds, sending IR blasts several times an hour - it controls a portable AC unit with a temp sensor).

I do see the wifi disconnects, although not nearly on that time schedule (maybe once or twice every 10 minutes). I always kind of assumed it was roaming (it constantly connects to a non-optimal AP, then connects to the optimal one). It receives IP's from my local DNS server, trying to connect to a US based AWS IP (I'm in the US).

marcan commented 3 years ago

@Silvenga it's not a roaming issue, the disconnects you're seeing are what we're seeing. They happen every 3-5 minutes.

The question is can we get them to stop doing that, and/or can we make HA robust against those resets.

felipediel commented 3 years ago

@felipediel he's right, please stop making stuff up about VLANs. VLANs behave the same as any normal isolated Ethernet network. The only correlation here is that there is a big overlap between the kind of geek paranoid enough to firewall IoT devices from the Internet and the kind of geek who happens to know about VLANs, and they are an obvious solution to this problem.

I already solved the issue. You just need to give their support team a link to this conversation. It's not that I'm stupid, I just don't have access to their firmware to fix it for you, got it?

Yes, I use VLANs too, and I am absolutely confident they have nothing whatsoever to do with this problem.

Now I am 100% sure this is the problem.

Networking is one of my jobs, I know what I'm doing here.

I respect your job, but we are never too old to learn something new.

As far as the devices involved are concerned, the Broadlink devices and one Ethernet (sub)interface on my Home Assistant server (which also handles DHCP/DNS/routing duties for this segment) are on the same isolated network segment, and the fact that VLANs are involved is completely irrelevant to them.

This is the expected behavior, but we are talking about a bug. They are binding the socket to the wrong interface.

felipediel commented 3 years ago

app Blocked.

felipediel commented 3 years ago

history No errors.

marcan commented 3 years ago

I already solved the issue. You just need to give their support team a link to this conversation. It's not that I'm stupid, I just don't have access to their firmware to fix it for you, got it?

You claim it works for you, yet it doesn't for us. We've already tried everything you suggested to make it work. The next step in figuring this out is for you to give us a known-good reference. That means a packet capture.

Yes, I use VLANs too, and I am absolutely confident they have nothing whatsoever to do with this problem.

Now I am 100% sure this is the problem.

Now you're just being unhelpful, and deliberately ignorant. We're telling you that's not how VLANs work. You can look it up if you want.

This is the expected behavior, but we are talking about a bug. They are binding the socket to the wrong interface.

There is no wrong interface. The device sees a single network. The device has one interface. The device does not have any idea what a VLAN is or what VLAN it's on, because all it sees is a single 802.11 WiFi network and it is the access point's job to deal with whatever is on the Ethernet wire behind it, be it plain Ethernet or 802.1q VLANs or an L2 tunnel over IP or anything else you might want to come up with. As far as the device is concerned it is on one network with one IP subnet and there is no confusion possible.

Asking support about VLANs isn't going to go anywhere, because VLANs are completely irrelevant to these devices. You have latched on to the idea that us using VLANs is the problem without understanding how VLANs work, and all you're doing now is derailing the conversation.

If you want to help us, please provide a full packet capture of everything your v57 device does on the network, so we can find out what to do to get it to stop rebooting itself after a cloud service timeout.

Silvenga commented 3 years ago

The question is can we get them to stop doing that, and/or can we make HA robust against those resets.

I see this as the standard "hardware is inherently unreliable" problem. I don't think we need to argue over if it's happening or why it's happening. It's going to happen as a function of being wifi based hardware.

We really need to figure out the scope of the problem, what it impacts, and figure out solutions.

@felipediel has spent a lot of effort and his time on this code, plus many weeks, going back and forth in reviews. I feel this thread has shifted to an argument, felipediel deserves respect at the very least, if not gratitude.

Silvenga commented 3 years ago

@felipediel I think you've fixed this issue effectively. I don't have issues, with at least my setup. So thanks a lot!

Maybe we should close this issue, and open a separate issue to gather more info on if the poll interval/method should be configured/changed?

felipediel commented 3 years ago

Thanks @Silvenga! I will create an options flow to configure polling in the future, so it will be easier for users to make adjustments without the need for a restart. After that, we can discuss what are the best values for each device and then we define better defaults.

marcan commented 3 years ago

@Silvenga People deserve credit for their work, and that is completely tangential to being told they are wrong when they are. I'm sure @felipediel has put a bunch of time into this integration, but he isn't being helpful right now by claiming the problem is something that makes no sense whatsoever.

That said, I've had improving the broadlink integration in my TODO list for a while now, in particular to specify a device-agnostic IR blasting mechanism to enable integration with complex-protocol/state-dump IR devices (e.g. aircons and my ceiling lights which work the same way), but having this kind of experience with the developers makes me lean towards just keeping it to myself rather than contributing...

@felipediel Look, I don't know what to say any more. VLANs don't have broadcast addresses. 802.1q VLANs are a way of putting multiple Ethernet networks into one physical cable. That is all they are. That is why they are called Virtual LANs. The only thing a VLAN does is make one cable behave like several separate cables. VLANs do not go over WiFi. Broadlink doesn't care about VLANs. Support doesn't care about VLANs. VLANs can't cause broadcast address confusion. Just, please, read up on the subject and drop the idea that we need to complain to Broadlink support about some broadcast address issue related to VLANs.

The problem we have is the devices reset every 3-5m when they can't hit the cloud. You claim yours does not. Please provide a packet dump if you are certain it is not doing that for you.

felipediel commented 3 years ago

@marcan I am not telling you to complain about VLANs. You can workaround the problem by disabling your VLAN if you want.

I am asking you to ask them to do this:

The obvious solution is implementing an additional logic to handle ICMP packets. If the device gets reject-with=icmp-admin-prohibited, the user blocked the connection intentionally, so the device should skip the reattempts. This can only be done in the firmware. Good luck with their support, I hope they fix it soon.

This is a simple and universal solution.

marcan commented 3 years ago

@felipediel "Disabling" my VLAN isn't going to do anything because, as multiple people have told you several times, VLANs are completely transparent to WiFi devices and they can't know nor care whether VLANs are in use or not.

Having one access point with VLANs connected to a host is literally equivalent in every way to having two separate access points with no VLANs connected to two network cards on the host. The WiFi devices cannot tell the difference. That's how VLANs work. That's the whole point of VLANs.

There is literally no way, shape, or form, for the Broadlink devices to know they are on a VLAN. They transmit and receive exactly the same packets. Every single bit. The same IP addresses. The same broadcast addresses. The same MAC addresses. VLANs make no difference. The only sides that are aware of VLANs are the wired devices that are VLAN-aware and used on tagged networks (which in my case includes all my switches, my AP, and my server).

If I could push a button and "disable" VLANs I would do it just to end this silly argument and prove that it doesn't matter, but VLANs are a core part of how my home network works, and I can't magically "disable" them. It's not possible to do what I do without VLANs without literally sticking 5 ethernet cards into my server and having 5 times as many switches.

I am asking you to ask them to do this:

The obvious solution is implementing an additional logic to handle ICMP packets. If the device gets reject-with=icmp-admin-prohibited, the user blocked the connection intentionally, so the device should skip the reattempts. This can only be done in the firmware. Good luck with their support, I hope they fix it soon.

This is a simple and universal solution.

Yes, and I highly doubt Broadlink cares about users running Home Assistant and blocking their cloud service, so I'm not holding my breath that complaining to support will get us anywhere.

But what we're trying to do here is find a solution that works today. You claim it works for you on v57. But instead of helping us by providing a packet log to show exactly what is necessary to make it work on v57, you are telling us the problem is "VLANs" without understanding how VLANs work. We can't wave a magic wand and figure out what you're doing to make it work. When something works in case A and not in case B then we need to understand what is different in both cases. You saw that everyone else happens to be using VLANs and wrongly concluded that they have anything to do with this. I am certain that you are wrong about VLANs, for the reasons I explained above, and which you can confirm if you study how VLANs work, how VLAN tags work, what a VLAN really does on the wire, and the fact that VLANs on the air over WiFi aren't a thing that exists. So what I am asking of you now is, since we're back to square 1 and we don't know what works and what doesn't, to help us by providing a packet log of your broadlink device, from cold startup through ~6 minutes, to show that it indeed doesn't reboot, and figure out what data was exchanged that made it not do that.

marcan commented 3 years ago

So, I started looking at the firmware and quickly found the 3 minute cloud timeout (it's exactly 3 minutes) in v57. It should be trivial to patch this to be infinite.

However, currently my device isn't accepting any firmware updates (not even the official unmodified ones) via the app; maybe it wants a different format for manual URL updates or something. I'll keep poking at it. I also e-mailed support; I don't expect them to fix this problem for us, but if I can convince them to give me some update URL that is supposed to work in the manual update mode, that'll be good enough to have something to work off of.

marcan commented 3 years ago

I cracked it.

The heartbeat message that comes in via the cloud is message type 0x01. The RM3 doesn't actually care if this comes from the cloud, via WiFi unicast, broadcast, or whatever. So, as long as you send a packet type 0x01 at least once every 3 minutes, even via broadcast on the WiFi, the devices will think they're connected to the cloud and stop rebooting. It doesn't even care about the packet checksum.

So, pending better integration of this into python-broadlink and/or HA, the quick fix is sticking this into your /etc/crontab:

* * * * * root echo -ne '\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x01\0\0\0\0\0\0\0\0\0' | nc -bu 192.168.7.255 80 -w 0

This broadcasts the heartbeat message every minute (substitute 192.168.7.255 with the broadcast address of your IoT network, of course). I think there is an additional timeout on top of the 3 minute cloud timeout, so I'm currently checking to see if we can afford to send it less often. (Edit: nope, needs to be every ~3 minutes; 4 minutes after stopping sending the packets both of my RM3s rebooted on exactly the same second.)

Filed mjg59/python-broadlink#458 for adding support to python-broadlink.

(as for @felipediel unless he provides a packet log to prove no reconnects/DHCP request, or otherwise show something else he's doing to trigger the device keepalive code, I'm just going to assume his devices are either successfully hitting the cloud, or rebooting like everyone else's, and he just isn't aware).

litinoveweedle commented 3 years ago

@marcan I am not telling you to complain about VLANs. You can workaround the problem by disabling your VLAN if you want.

@felipediel With all regards to your work as skilled programmer, you are not network engineer. What you are saying is simply not possible, but do really prove you, I did setup simple spare home router with only one Wi-Fi network and one IP subnet (so no VLANs). Then I connected my RM mini 3 by Android app and set firewall rule to disconnect internet access.

Without any surprise devices keeps reconnecting. I know that you would not believe me, so I am ready to provide your remote access to my tested, for you to see. Than I would expect you to acknowledge your mistake statements about VLANs here. As you said, we are never to old to learn something new.

litinoveweedle commented 3 years ago

I cracked it.

@marcan YOU ARE THE KING! Thank you very much, that was very constructive and really helpful approach. Thank you.

@felipediel Could you please provide such keep-alive reboot prevention for isolated Broadlink devices in the integration? Disregarding this flamed thread there are many affected users without this fix. Thank you.

litinoveweedle commented 3 years ago

So, pending better integration of this into python-broadlink and/or HA, the quick fix is sticking this into your /etc/crontab:
* * * * * root echo -ne '\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x01\0\0\0\0
Depending on the OS, you may get error:

invalid wait-time 0

as explained for example here, zero timeout is not supported in all netcat implementations (I just tested my Debian Buster default netcat = netcat-traditional and I am having this issue). I did quick change of waiting timeout to 1sec, i.e.:

* * * * * root echo -ne '\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\x01\0\0\0\0\0\0\0\0\0' | nc -bu 192.168.3.255 80 -w 1

Then I restore my firewall rule blocking internet access from the Broadlink devices, but unfortunately devices keeps reconnecting. I did network capture and I can see UDP to my broadcast IP to port 80 being sent from my my HA via correct WLAN interface:

0000: ff ff ff ff ff ff 1c 69  7a 0b 8b 6d 81 00 00 1e  .......i z..m....
0010: 08 00 45 00 00 54 3c 41  40 00 40 11 76 06 c0 a8  ..E..T<A @.@.v...
0020: 03 02 c0 a8 03 ff b9 29  00 50 00 40 9e 60 2d 6e  .......) .P.@.`-n
0030: 65 20 00 00 00 00 00 00  00 00 00 00 00 00 00 00  e ...... ........
0040: 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  ........ ........
0050: 00 00 00 00 00 00 00 00  5c 78 30 31 00 00 00 00  ........ \x01....
0060: 00 00 00 00 00 0a                                 ......

I tested it also with netcat-openbsd, which accepts 0 timeout, but the results is same - packet is sent, but device keeps rebooting. Any suggestion?

marcan commented 3 years ago

You're getting a literal \x01 in the packet, so the echo part is also different for you. Maybe you're using a different shell? It's supposed to be a 0x01 byte. There's also a \n at the end and -ne at the beginning, so it looks like your version of echo doesn't support the required options. I think the bash built-in echo should work, maybe this is a dash thing?

home-assistant / core