greearb / ath10k-ct

Stand-alone ath10k driver based on Candela Technologies Linux kernel.
111 stars 40 forks source link

Ecobee thermostats appear as not responding in Apple Home app #114

Closed ed-velez closed 4 years ago

ed-velez commented 4 years ago

I have a pair of Ecobee 3 thermostats that will appear as ''Not Responding" after some time in the Apple Home app. This behavior goes away and returns throughout the day. The odd thing is that the thermostats appear to always respond properly when I am using the native Ecobee app to interface with them. All he other homekit devices I use don't really see an issue.

I am currently running Hnyman's Openwrt master branch snapshot builds for the R7800 found here. I've tried your beta firmware image from 12/16/19 and still see the issue. I suspect it to be a firmware issue because I actually tried the stock QCA firmware build specifically the 10.4-3.9.0.2-00070 firmware build and I'm not really seeing the issue. Also everything appears ok with the vendor stock firmware as well.

Not sure if its a power savings issue because the Home app will show the device state as 'Updating' and eventually error out with "Not Responding". And when that occurs the device is initially unresponsive to ping but then eventually starts responding to my ping requests after several seconds. Could be similar to issue #80.

I have a pair of R7800s so I am happy to temporarily devote one to testing.

greearb commented 4 years ago

Please see issue #50, where that bug was bisected to commit 1085. Can you see if build 1084 works better for you as well? If that is not the issue, then please try bisecting using the binaries linked earlier in issue #50.

ed-velez commented 4 years ago

Did some preliminary testing and I don't think 1084 actually fixes my issue. I eventually triggered the Home app to show "Not Responding" for the thermostats. The interesting thing I am seeing in my testing is that with some builds - the devices might appear as "Not Responding" but in a subsequent test - the Home app is able to establish connectivity in a subsequent close and open of the app. Other builds it seems that when the device goes not responding - the app doesn't appear to re-establish connectivity with subsequent reopening of the app. The 1084 build was in the category of appearing to sometimes being able to re-establish connectivity after a subsequent reopening. This might be anecdotal but I thought I would put it out there.

I am still in the process of testing builds but I want to say my issue crops up between builds 200-250. I should have more time tonight/tomorrow to finish the testing. Right now I am just testing the HTT-MGT builds. Would there be any value in trying to find the issue with the non HTT-MGT builds?

ed-velez commented 4 years ago

Right now I want to say that firmware-5-full-htt-mgt-community-commit-232-6fe5808d8.bin is the one where I definitely begin to see the "Not Responding" flakiness in the Home app. I will say some builds seem like it takes longer to trigger an issue - but swapping between 231 and 232 almost immediately seems like it causes the issue I am experiencing. I'll use 231 for a little longer and make sure it definitely is good and just not a scenario where that build just takes longer to trigger the issue. If that's the case I'll report back.

With that said what changed in 232? Also is the commit log for these changes posted anywhere?

ed-velez commented 4 years ago

Also just to elaborate on my setup - I use both my R7800s in access point mode and really only doing that. All my IOT devices (including these thermostats) connect to the 2.4Ghz band. Currently I just have one device with the 5Ghz band enabled and the other with only the 2.4Ghz band enabled. I'll enable both bands on both devices once I can figure out the cause of this particular issue. Its just setup this way to minimize the complexity for troubleshooting.

timkgh commented 4 years ago

I have very similar issues with my Nest thermostat and other lower powered devices with CT firmware. I found that latest OpenWRT builds still work with the "old" official firmware so that's how I tend to run my R7800 (also as a dumb AP, but with multiple VLANs and SSIDs -- though I tested that in the past by reverting to a default config and the WiFi problems are still there). I also have other issues with CT that I detailed in my issue #80 Sounds like @greearb needs to upgrade his home to a "smart" thermostat? :)

greearb commented 4 years ago

6fe5808d8 is a relatively minor (tm) change to rate-ctrl. Let me try to track down bug #50, which may be similar and seems easy to reproduce, and then I can take a closer look at 6fe5808d8 if the problem is not already fixed.

I have a smart thermostat, but use an ancient ath9k AP because my DSL is so crappy that even /n is overkill :)

ed-velez commented 4 years ago

@timkgh Does 231 work better for you than 232? I wonder if its the same issue.

timkgh commented 4 years ago

With 232 I cannot get any 2.4GHz devices to connect at all.

231 looked better, I could ping the Nest sometimes but not always. But the 5GHz radio just froze with it, my MBP showed a 6Mbps link speed and could not pass any data anymore though it stayed connected.

Just like in your setup, I keep the smart/IoT devices on 2.4GHz and laptops, tablets, phones on 5GHz, I use different SSIDs so that they don't get to connect to whatever band they feel like it.

timkgh commented 4 years ago

@ed-velez fwiw, "old" firmware firmware-5.bin_10.4-3.10-00047 works very well for me and it still works with the latest OpenWRT builds, if you are looking for a solution until the CT issues are worked out. https://github.com/kvalo/ath10k-firmware/tree/master/QCA9984/hw1.0/3.10

ed-velez commented 4 years ago

@timkgh You just made me realize that I wasn't using the latest legacy firmware version. But yeah I'm sticking to that version for now until the ct build gets worked out. Still happy to do any debugging and testing as needed.

greearb commented 4 years ago

If you can bisect, here is a new series for 9984:

http://www.candelatech.com/downloads/ath10k-9984-10-4b/bisect/all_builds-9984-H-jan-27-2020.tar.gz

ed-velez commented 4 years ago

@greearb Same results with that series. I don't see issues till 232. One thing of note is that 232 seems to break fairly drastically and the thermostats don't seem to connect at all. I wanna say other devices appear to connect (at least it appears that way in the gui). But the logs suggest that there was a firmware crash.

greearb commented 4 years ago

Please send me or attach the dmesg output if there was a FW crash. The 232 patch adds debugging that would cause an assert when some bad things happen, so maybe your setup is triggering those bad things.

greearb commented 4 years ago

Also, can you test build 241? The patches between 232 and 241 fix some nasty bugs.

ed-velez commented 4 years ago

Please send me or attach the dmesg output if there was a FW crash. The 232 patch adds debugging that would cause an assert when some bad things happen, so maybe your setup is triggering those bad things.

https://gist.github.com/ed-velez/98b3cbd822e8055b7d3e81763e331487

greearb commented 4 years ago

Ok that was the assert I added in 232. if 241 still acts weird, let me know, and attach dmesg if it crashes.

ed-velez commented 4 years ago

@greearb 241 Similarly appears to crash:

https://gist.github.com/ed-velez/75886e8200c5349cca78ee4639a69563

greearb commented 4 years ago

Ok, I moved the assert to the top of the tree. Hopefully this will let you bisect whatever remaining problems happen. Before I rebuild all the images again, please check to see if this one functions at least as good as 231 (ie, one before the bad commit). firmware-5-full-htt-mgt-community.bin.gz

ed-velez commented 4 years ago

@greearb Yep that image seems ok thus far.

greearb commented 4 years ago

Here is a new series to bisect: http://www.candelatech.com/downloads/ath10k-9984-10-4b/bisect/all_builds-9984-H-jan-28-2020.tar.gz

ed-velez commented 4 years ago

It looks like up till 561 is ok but then 562 fails and causes a firmware crash:

https://gist.github.com/ed-velez/817071b077ec983c27c647ef26bb9bf1

(I'm curious about #103 as I was originally thinking that the issue had to due with multicast since the problem was related to homekit.)

I'll try to skip a few and see if things start working at some point.

ed-velez commented 4 years ago

Actually will need more time to test. The bug I am experiencing is a bit harder for me to trigger myself now. I just experienced it with 561 so I will do more testing.

ed-velez commented 4 years ago

@greearb Currently testing the latest firmware you posted in #103 as I'm pretty certain the multicast issue debugged there is the same issue that I'm experiencing with my thermostats and Homekit. Will update and confirm.

ed-velez commented 4 years ago

@greearb Definitely seems like my devices aren't really experiencing the issue with disconnecting with that latest firmware you posted in #103.

With that said, it seems like there is now a performance issue with that build. Performing consecutive pings on the access point shows bursts of significant latency. Additionally, an app I use to perform mDNS service discovery is way way slower. And FYI I don't see this latency at all when using 210 as noted as being the most stable and free of the multicast issue in the most recent bisect series.

greearb commented 4 years ago

Change 562 should be a complete non-operational change from the previous commit, so would be strange if it triggered a bug. Is the crash with it reproducible? Anyway, I'm doing a new series to bisect with one of the problematic changes in #103 moved to top of tree in hopes that will make the bisect for other problems easier.

ed-velez commented 4 years ago

@greearb So testing with the new series - I don't think the firmware crash is specific to 562. I probably suggested 562 since the crash happened right after reboot. (For my testing - I'm just copying the firmware into place and rebooting.) But this particular crash doesn't always appear to happen just after reboot. So far the earliest I am to see the crash in the logs is 513: https://gist.github.com/ed-velez/af9969258652062cc063bd80b309afc3

Now with that said - I am still seeing the random "Not responding" behavior that I originally mentioned. I'll spend some time trying to determine if there is a specific build that the issue occurs with using the latest series.

ed-velez commented 4 years ago

Ooof looks like it actually happens earlier...just experienced it running 500:

[ 2188.397324] ath10k_pci 0001:01:00.0: firmware crashed! (guid n/a)
[ 2188.397370] ath10k_pci 0001:01:00.0: qca9984/qca9994 hw1.0 target 0x01000000 chip_id 0x00000000 sub 168c:cafe
[ 2188.402402] ath10k_pci 0001:01:00.0: kconfig debug 0 debugfs 1 tracing 0 dfs 1 testmode 0
[ 2188.414440] ath10k_pci 0001:01:00.0: firmware ver 10.4b-ct-9984-fH-009-76277be15 api 5 features mfp,peer-flow-ctrl,txstatus-noack,wmi-10.x-CT,ratemask-CT,regdump-CT,txrate-CT,flush-all-CT,pingpong-CT,ch-regs-CT,nop-CT,htt-mgt-CT,set-special-CT crc32 2730faeb
[ 2188.421547] ath10k_pci 0001:01:00.0: board_file api 2 bmi_id 0:2 crc32 85498734
[ 2188.443209] ath10k_pci 0001:01:00.0: htt-ver 2.2 wmi-op 6 htt-op 4 cal pre-cal-file max-sta 32 raw 0 hwcrypto 1

will do more waiting and testing...

ed-velez commented 4 years ago

@greearb Right now I see the thermostat homekit disconnects in 409-bdded3bbe with everything appearing ok with 408. Again, this is separate than the above firmware crashes that I have found. I'll see if I can find which commit the firmware crashes begin to appear next.

greearb commented 4 years ago

Hello,

Probably best to just debug the 409 issue first, as long as the crash doesn't happen in top-of-tree then probably it is already fixed.

On 02/03/2020 05:22 AM, Ed Velez wrote:

@greearb https://github.com/greearb Right now I see the thermostat homekit disconnects in |409-bdded3bbe| with everything appearing ok up till 408. Again, this is separate than the above firmware crashes that I have found. I'll see if I can find which commit the firmware crashes begin to appear next.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/greearb/ath10k-ct/issues/114?email_source=notifications&email_token=AACHNKVMSFRMCH3IASPGIXTRBALBTA5CNFSM4KKOM5RKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEKT2CGQ#issuecomment-581411098, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACHNKTW2GWT7W46JBXESQDRBALBTANCNFSM4KKOM5RA.

-- Ben Greear greearb@candelatech.com Candela Technologies Inc http://www.candelatech.com

ed-velez commented 4 years ago

@greearb Sure I'll double check and make sure I don't see it on top of the tree. The crash could perhaps even be related to what I might be seeing in 409. Thus far I've seen the firmware crash issue as low as 425. Similarly its not something that I can trigger on a whim - I've had to wait some time for something to appear in the log.

greearb commented 4 years ago

How reliable (repeatable) is the 408/409 bisect? Patch 409 is mostly whitespace and debugging, but there are a few functional changes that I guess could cause problems. Please confirm you see good repeatability in good/bad on those two commits and if so, I will break up 409 to narrow down the problem.

ed-velez commented 4 years ago

@greearb sure thing. may take a few days of testings to have clear certainty.

ed-velez commented 4 years ago

@greearb I left 408 on my router overnight and didn't really experience any issues in the logs or when accessing the home app. I had another separate device(a HomePod) appearing as "not responding" with 408 but that seemed to exclusively on my MacBook and not on my phone(so probably separate issue that might not appear with the top of the tree).

But regardless, I put 409 back on and fairly immediately began to see the issues I noted. I will also note that the firmware crash I previously detailed also occurred with 409 - so something definitely seems amiss with a change in 409 in my testing. I ran with 408 for like 16 hours and didn't see any evidence of a firmware crash.

greearb commented 4 years ago

I split 409 into two patches. -a contains some code that looks questionable. -b should be same as 409. Please test with -a and see if it reproduces the problem, and if not, please test with -b to make sure it still causes the problem. firmware-5-full-htt-mgt-community.bin-a.gz firmware-5-full-htt-mgt-community.bin-b.gz If you see crashes, please attach dmesg.

ed-velez commented 4 years ago

Yeah definitely looks like that -a is reproducing the issue.

greearb commented 4 years ago

I think I see the problem, I had deleted some confusing (but likely correct) code, probably after mis-reading it. I think it would cause frames to be transmitted when they should not be in some cases.

Please test this, it is 409 with the code un-deleted. firmware-5-full-htt-mgt-community.bin-b2.gz

ed-velez commented 4 years ago

Causes the firmware crash almost much more quickly: https://gist.github.com/ed-velez/47c3cbb4d950ce861b272a059dda727e

greearb commented 4 years ago

Ok, please try this -a3 image. Will put a -b3 up in a bit. firmware-5-full-htt-mgt-community.bin-a3.gz

greearb commented 4 years ago

Here is a new -b3. firmware-5-full-htt-mgt-community.bin-b3.gz

greearb commented 4 years ago

y3 is top-of-tree minus patch previously shown to be questionable, z3 is top-of-tree. Please test these as well assuming the ones above work. firmware-5-full-htt-mgt-community.bin-z3.gz firmware-5-full-htt-mgt-community.bin-y3.gz

ed-velez commented 4 years ago

So not sure if any of this will make sense:

a3 - bad b3 - bad z3 - bad y3 - seems ok

And by "bad" I'm strictly referring to the thermostat "not responding" issue. I'm sort of assuming the firmware crashes might be related but I didn't explicitly verify in the logs(i can if you need me to). It does appear that y3 seems ok, I'll keep it on for the day and report back if I see anything strange.

greearb commented 4 years ago

Thanks for testing, I thought that a3 would have been OK and have zero functional changes. I need to re-check the patch, and next time I post an a4, please attach dmesg firmware load info that shows the git commit-id so I can make sure I didn't screw up the build/upload somehow.

greearb commented 4 years ago

This commit should be same as 408, please test to make sure it still works. In case it crashes, attach dmesg. firmware-5-full-htt-mgt-community.bin-408.gz

greearb commented 4 years ago

Ok, here is -a4. It backports the fix for the assert/crash that the original 409 commit caused. Please let me know how it works, and if it crashes, add dmesg. firmware-5-full-htt-mgt-community.bin-a4.gz

ed-velez commented 4 years ago

The -408 appears to work fine. The -a4 causes the not responding issue in the Home app but doesn't seem to be crashing with 45min of uptime.

greearb commented 4 years ago

Ok, I split up the 'a' patch into 3 smaller ones. Please let me know where breakage starts. firmware-5-full-htt-mgt-community.bin-a5.gz firmware-5-full-htt-mgt-community.bin-b5.gz firmware-5-full-htt-mgt-community.bin-c5.gz

ed-velez commented 4 years ago

@greearb Ok looks like I start seeing the issue with -b5.

ed-velez commented 4 years ago

I will also note that the issue seems waaaay harder to trigger when that commit is broken up like this. From the app perspective it may show as "Not Responding" but then eventually connects. However, what I am using to assess is my app that does allows me to perform mdns discovery. The entries that should be broadcast by the thermocasts will eventually appear missing with -b5.

timkgh commented 4 years ago

FWIW the way I can quickly tell things are not right with my Nest: I let it turn its screen off which I assume puts it in some power save mode. I start a continuous ping to it from my Mac and I expect it to not respond to pings with good or bad firmware while its screen is off. Then I wave close to it to wake it up, the screen comes on and I expect it to start responding to pings. It works reliably with the "old" firmware but with "ct", depending on version, it either never responds or it's erratic, sometimes it responds but more often than not it does not. Something is not right when it transitions from power save to awake. With "ct" firmware I also notice other problems between different wifi clients, e.g. ssh connections between 2 Macs on the same WLAN are very laggy, typing feels like it happens over a very slow link with high packet loss. I also have a client based on MediaTek LinkIt Smart 7688 that when pinged, consistently drops every 4-5th packet with "ct".

greearb commented 4 years ago

timkgh, can you test to see if a5 (or maybe the 408 build) works well for you, and see if b5 does not?

And, earlier, the y3 was reported as OK. If b5 (or c5) is noticeably worse than y3, then somewhere in the 700 commits between those, I must have fixed the problem. If ed-velez is willing, I can do a new series of builds so he can bisect where the fix came in. I hope that would let me better understand what is the problem in b5 since the code looks fine to me.