Closed rgerbranda closed 10 months ago
@allenporter thanks for all your help!
At least interim solution that works, when you do the setup, you can choose which devices that you want to import. I just excluded my cameras and doorbell, this way the thermostats could still be imported. It's camera feeds/events bringing mine down
@crenus tried the same by only importing the nest v3 thermostat but the cpu and the temperature is rising like crazy after that, thanks for the tip tho
@tcc0 what pi are you running, and are you running a lot of other containers or other on it?
I moved off a pi3b+ -> pi 4 and from sd to an ssd all during this and it helped drastically overall regardless. My cpu and memory utilization were high before, running like 7 containers, daily backups, and few other services. Great to start out, but not the best for the long term
@allenporter would you mind posting this on the grpc github issue tracker too?
Edit: this issue seems possibly related. I wonder if downgrading grpcio to 1.44.0 could fix the issue?
Looks like the same issue
I experienced heavy CPU load too on a pi 4 but it's with the google cloud plattform integration for tts. I assume that one uses the grpc library too but i didnt look into it. Should i open a new issue for what i think is the same problem or would this comment be suficient?
They also closed the issue in th grpc project a few minutes ago since they are "working to introduce a better model"
Hi. I think the same underlying issue is the same, so fine to just track here. Yes, good that the grpc developers are aware of this and working on a solution.
@allenporter Noticed latest grpc release has below fix incorporated so perhaps time to bump to 1.51.1 ?
Python Fix lack of cooldown between poll attempts. (https://github.com/grpc/grpc/pull/31550)
Cython was previously ignoring the value in the pxd file and zero-initializing the period. Meanwhile, an issue with our continuous benchmarks meant that this was not noticed until now. This should result in a significant decrease in CPU usage.
Fantastic.That meets the description of the issue I was seeing when debugging.
Last time I bumped grpc there were some other dependency issues for protobuf 4. If these aren't resolved yet it may not just be a trivial bump, but either way I'll manage it.
This is great news. Thank you so much for staying on top of this issue.
grpc 1.51.1 was integrated in https://github.com/home-assistant/core/pull/83420
Perhaps we can confirm folks see this helps on a dev build and we can consider this closed.
(I no longer have an arm device available for testing)
I'm afraid upgrading to 1.51.1 wasn't the solution after all :( I've upgraded to HA dev and confirmed grpc 1.51.1 is installed, but load isn't dropping just yet (it's been a couple of hours after I've upgraded). I have also disabled and re-enabled the project on the Device Access Console at Google, hoping to clear any backlog or cache there (if that's even a thing).
top - 14:12:28 up 2 days, 23:16, 2 users, load average: 1.70, 1.24, 1.09
Tasks: 198 total, 1 running, 197 sleeping, 0 stopped, 0 zombie
%Cpu(s): 12.2 us, 11.8 sy, 0.0 ni, 75.6 id, 0.4 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 3844.3 total, 835.1 free, 1202.9 used, 1806.3 buff/cache
MiB Swap: 100.0 total, 98.7 free, 1.3 used. 2269.9 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
23750 root 20 0 491588 236772 48348 S 77.7 6.0 192:12.00 python3 -m homeassistant --config /config
Also this shows up in my log (has been there for quite some time, I think it's related to our issue):
2022-12-08 14:10:20.755 INFO (MainThread) [google_nest_sdm.event_media] Checking cache size 1010
2022-12-08 14:10:28.124 INFO (MainThread) [google_nest_sdm.event_media] Checking cache size 1010
2022-12-08 14:11:10.428 INFO (MainThread) [google_nest_sdm.event_media] Checking cache size 1011
Thanks I'll remove the logging, that's a mistake.
Sorry, why do you think this log is related? I already diagnosed the root cause as being grpc overhead and this isn't it. It may just be a different grpc issue.
I upgraded my 2022.12.3 instance to 1.51 musl wheels and sadly see a substantial increase in short and longer term load. Logs do not show anything strange and all seems to work fine. Any logging I can provide to help out?
Hi, you need to use a profiler. See updates from me on this bug or otber related bugs where I've posted profiler results as examples.
Seeing the same issue. Bit of a noob question - how do i apply the fix?
@DrMikeyS @dynasticorpheus the new version of grpc has not yet been integrated into home assistant production build.
Ah - understood. What are the typical time frames for such a bugfix being pushed into prod?
On Wed, 21 Dec 2022 at 18:08, Allen Porter @.***> wrote:
@DrMikeyS https://github.com/DrMikeyS @dynasticorpheus https://github.com/dynasticorpheus the new version of grpc has not yet been integrated into home assistant production build.
— Reply to this email directly, view it on GitHub https://github.com/home-assistant/core/issues/66983#issuecomment-1361788560, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB26HX62X2TFBPN75SULD4DWONBRFANCNFSM5O6POPZQ . You are receiving this because you were mentioned.Message ID: @.***>
@allenporter I upgraded grpc manually using pre-build wheels on my prod instance as per my previous comment. Unsure if this is a proper test of course.
Ah - understood. What are the typical time frames for such a bugfix being pushed into prod?
Home assistant releases happen monthly. (I don't believe this was tagged for a patch release)
Ah - understood. What are the typical time frames for such a bugfix being pushed into prod?
Home assistant releases happen monthly. (I don't believe this was tagged for a patch release)
Thanks for the update. And thanks for the work on the project.
As a workaround until the update is pushed out, I have limited the priority and capped the CPU usage of python on the pi. This has had no detrimental effect but keeps a bit of slack free for other processes and prevents it getting too hot/using more power than needed.
top
sudo cpulimit -p [Python PID] --limit 30 --background
renice -n -12 -p [Python PID]
@allenporter 2023.1.0b1 is showing even higher cpu load despite 1.51. To make things worse I can't be of any help as there are no py-spy musl wheels I can install on my armv7 box to provide logs :(
2023.01 also introduces a new Google Assistant SDK integration . Since it's another Google Cloud API thing, maybe it'll also suffer from the same problem
@dynasticorpheus Thanks for the update. Python tracing won't really show the issue and it needs to be observed at a lower level. The good news is that grpc is rewriting their event engine so they are working on this, but its not a fast fix.
@allenporter Simple HTOP kind of feedback but still wanted to share official 2023.1.0 release has calmed down on my end CPU usage wise! (compared to initial beta's)
2023.1.0 has also knocked power consumption on my (Intel) server back down to similar levels when Nest integration was disabled. Thanks for all the time and effort spent on this!
On Thu, 5 Jan 2023 at 07:59, dynasticorpheus @.***> wrote:
@allenporter https://github.com/allenporter Simple HTOP kind of feedback but still wanted to share official 2023.1.0 release has calmed down on my end CPU usage wise!
— Reply to this email directly, view it on GitHub https://github.com/home-assistant/core/issues/66983#issuecomment-1371890728, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTV5MNP4LFCUROKM7KAWQDWQZ5NNANCNFSM5O6POPZQ . You are receiving this because you commented.Message ID: @.***>
Hi, I just updated HA (docker installation) to 2023.1.0, but my RPI 3 CPU usage has been increased from ~50% to ~100%... from what I can see from "htop", there are three HA threads which cause all the CPU load, with one of them constantly using 50% of the CPU, and the other two, at alternated times, using the remaining 50% .
(PIDs 48, 53 and 64)
Python profiler still shows nothing, so the issue is still in some native code:
I tried to retrieve some information for such threads adding "gdb" to the docker image and performing a "thread apply all bt" command on the HA's python process, and all three threads report the library "/usr/local/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-arm-linux-gnueabihf.so" in their stacks, so there are still some other issues in the grpc library:
(PID 53)
Thread 44 (LWP 53 "python3"):
#0 0x76f6fa6e in syscall () from /lib/ld-musl-armhf.so.1
#1 0x57a075e2 in ?? () from /usr/local/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-arm-linux-gnueabihf.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(PID 48 - took wile using CPU)
Thread 39 (LWP 48 "python3"):
#0 0x76f6fa6e in syscall () from /lib/ld-musl-armhf.so.1
#1 0x57a075e2 in ?? () from /usr/local/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-arm-linux-gnueabihf.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(PID 64 - took wile using CPU)
Thread 53 (LWP 64 "python3"):
#0 0x76f6fa6c in syscall () from /lib/ld-musl-armhf.so.1
#1 0x57a075e2 in ?? () from /usr/local/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-arm-linux-gnueabihf.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
I also tried to enable grpc logs (GRPC_TRACE=api and GRPC_VERBOSITY=info environmental variable), but it seems that the polling problem reported in the https://github.com/googleapis/python-pubsub/issues/728 has been fixed, as the container grpc library is 1.51.1 and logs reports a poll message every 200ms, as it should:
I0105 11:25:21.678587881 83 completion_queue.cc:948] grpc_completion_queue_next(cq=0x5c00a0e0, deadline=gpr_timespec { tv_sec: 1672914321, tv_nsec: 878563767, clock_type: 1 }, reserved=0)
I0105 11:25:21.879211785 83 completion_queue.cc:948] grpc_completion_queue_next(cq=0x5c00a0e0, deadline=gpr_timespec { tv_sec: 1672914322, tv_nsec: 79186160, clock_type: 1 }, reserved=0)
I0105 11:25:22.080915739 83 completion_queue.cc:948] grpc_completion_queue_next(cq=0x5c00a0e0, deadline=gpr_timespec { tv_sec: 1672914322, tv_nsec: 280890114, clock_type: 1 }, reserved=0)
I0105 11:25:22.281308965 83 completion_queue.cc:948] grpc_completion_queue_next(cq=0x5c00a0e0, deadline=gpr_timespec { tv_sec: 1672914322, tv_nsec: 481280945, clock_type: 1 }, reserved=0)
I0105 11:25:22.483128597 83 completion_queue.cc:948] grpc_completion_queue_next(cq=0x5c00a0e0, deadline=gpr_timespec { tv_sec: 1672914322, tv_nsec: 683102347, clock_type: 1 }, reserved=0)
so there should still be something wrong on grpcimplementation...
As an addition, I tried to profile both the main HA process and threads using the CPU with "strace -Tcfp
(from main process)
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
57.69 38.438107 2645 14527 13052 futex
11.76 7.838140 712558 11 3 restart_syscall
10.05 6.697177 148826 45 _newselect
8.33 5.549870 168177 33 poll
8.13 5.415075 13272 408 epoll_pwait
3.76 2.506349 47 53049 clock_gettime64
0.07 0.046078 88 521 write
0.05 0.035340 66 533 _llseek
[...]
(from the thread constantly using the CPU)
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
46.67 34.166324 2457 13904 12848 futex
16.63 12.172873 1106624 11 4 restart_syscall
16.35 11.971184 244309 49 poll
8.70 6.367453 219567 29 _newselect
7.78 5.698243 11086 514 epoll_pwait
3.40 2.487209 47 52787 clock_gettime64
0.20 0.145667 446 326 read
[...]
(from the thread using CPU intermittently)
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
51.00 22.481324 2182 10299 9492 futex
20.65 9.102114 827464 11 4 restart_syscall
14.25 6.281863 224352 28 _newselect
8.63 3.802922 9320 408 epoll_pwait
4.09 1.803817 46 39178 clock_gettime64
1.12 0.491646 8779 56 poll
0.04 0.019802 119 166 23 recvfrom
[...]
so, if I'm interpreting it correctly, it should confirm that there are still some issues on threading/polling management on grpc library on ARM.
Well, this is much worse for me now. Previously I was at 5% CPU without Nest, jumping to 25% with it enabled. I re-enabled Nest this morning and CPU jumps to more than 40%! On a raspberry pi 4
I have moved to an Intel NUC and post the update the baseline CPU usage has dropped from 10% to 7% which is very welcome.
On Thu, 5 Jan 2023 at 11:38, Photoexploration @.***> wrote:
Well, this is much worse for me now. Previously I was at 5% CPU without Nest, jumping to 25% with it enabled. I re-enabled Nest this morning and CPU jumps to more than 40%! On a raspberry pi 4
— Reply to this email directly, view it on GitHub https://github.com/home-assistant/core/issues/66983#issuecomment-1372109368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB26HX2MQBLP47EFT2JNZJLWQ2XE7ANCNFSM5O6POPZQ . You are receiving this because you were mentioned.Message ID: @.***>
Jumped from 20% to 35-40% here since a while.
Too bad my raspberry pi 3 is running 65-70 degrees 24/7 since a couple of months only so I can control my thermostat 1 or 2 times a day, should be working better
I have been seeing a jump from 5% CPU without the Nest integration to 40% with the integration.
I have been running HomeAssistantOS 32 bit on a raspberry pi 4. The other day I migrated my HAOS to the 64 bit version and my CPU usage stays at 5% while running the Nest integration. Yay!
I don't know if this is already known but I thought it may be helpful to someone.
@lukakama would you be up to file another issue with grpc? You have a pretty good profile data ,and may be able to give them the detail they need. (I don't currently have an rpi in my possession and haven't been able to get one due to the chip shortage)
@lukakama would you be up to file another issue with grpc? You have a pretty good profile data ,and may be able to give them the detail they need. (I don't currently have an rpi in my possession and haven't been able to get one due to the chip shortage)
Done, sharing also another log from strace showing some issue in futex
instruction usage (some sort of thread synchronization locking but with already expired absolute timeouts), which should help them to better trace the source of the issue.
However, I think I will move to a 64bit OS as soon as possible (thanks @Photoexploration ! ), in order to workaround the problem and reduce my RPi power usage.
I switched to Raspberry Pi OS (64-bit) (thanks @Photoexploration ) and now the Nest integratrion is working fine! The CPU load is back to normal.
Did the same 64-Bit swap as rgerbranda, my cpu finally lowered to like 5% usage and the low 50's in temp since summer last year.
There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.
Still an issue - I am running a new install on a raspberry pi 4, hosted via docker compose.
spotted high CPU last night and narrowed it down to the nest integration. CPU is around 30% with best enabled, 1-2% with it disabled.
i assume we are needing to wait for a fix further upstream based on the comment history, but I just wanted to confirm that it’s still an issue on the latest HA version.
This impacts me since I was using the old works-with-nest mode until support for that mode was removed in HA 2023.6.1. Of course we only had till Sept 29 to migrate anyway.
I am running HA 2023.6.2 on a raspberry pi 2b / armv7 hosted via docker compose, and CPU has jumped from 0.5% to 20% with this new mode with the same nest integration. As noted above it appears we are waiting for an upstream fix.
@allenporter @bdraco I just manually applied #100314 (Bump grpcio to 1.58.0) on 2023.9.2 and the CPU load (and therefore temperature) has come down dramatically on my armv7 device.
That's great news. The lib bump with come out with 2023.10.x
Let's call this fixed as of 2023.10.x -- great that the bump fixed it.
The problem
When I enable the Google Nest integration, I see a continous CPU load of python3 of about 65% Without this integration CPU load of python3 is just about 2%
Any recommendation to optimize the load of the Google Nest integration?
What version of Home Assistant Core has the issue?
2022.2.9
What was the last working version of Home Assistant Core?
No response
What type of installation are you running?
Home Assistant Container
Integration causing the issue
nest
Link to integration documentation on our website
https://www.home-assistant.io/integrations/nest
Diagnostics information
Example YAML snippet
No response
Anything in the logs that might be useful for us?
No response
Additional information
No response