Nest integration high CPU usage on armv7 / raspberry pi in pubsub subscriber native code (fixed in 2023.10.x)

rgerbranda commented 2 years ago

The problem

When I enable the Google Nest integration, I see a continous CPU load of python3 of about 65% Without this integration CPU load of python3 is just about 2%

Any recommendation to optimize the load of the Google Nest integration?

What version of Home Assistant Core has the issue?

2022.2.9

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant Container

Integration causing the issue

nest

Link to integration documentation on our website

https://www.home-assistant.io/integrations/nest

Diagnostics information

{
  "home_assistant": {
    "installation_type": "Home Assistant Container",
    "version": "2022.2.9",
    "dev": false,
    "hassio": false,
    "virtualenv": false,
    "python_version": "3.9.7",
    "docker": true,
    "arch": "armv7l",
    "timezone": "Europe/Brussels",
    "os_name": "Linux",
    "os_version": "5.10.63-v7l+",
    "run_as_root": true
  },
  "custom_components": {
    "config_editor": {
      "version": "3.0",
      "requirements": []
    },
    "hacs": {
      "version": "1.22.0",
      "requirements": [
        "aiogithubapi>=21.11.0"
      ]
    },
    "bosch": {
      "version": "0.17.3",
      "requirements": [
        "bosch-thermostat-client==0.17.3"
      ]
    },
    "afvalbeheer": {
      "version": "4.9.2",
      "requirements": [
        "rsa",
        "pycryptodome"
      ]
    },
    "authenticated": {
      "version": "21.9.0",
      "requirements": []
    }
  },
  "integration_manifest": {
    "domain": "nest",
    "name": "Nest",
    "config_flow": true,
    "dependencies": [
      "ffmpeg",
      "http",
      "media_source"
    ],
    "documentation": "https://www.home-assistant.io/integrations/nest",
    "requirements": [
      "python-nest==4.2.0",
      "google-nest-sdm==1.7.1"
    ],
    "codeowners": [
      "@allenporter"
    ],
    "quality_scale": "platinum",
    "dhcp": [
      {
        "macaddress": "18B430*"
      },
      {
        "macaddress": "641666*"
      },
      {
        "macaddress": "D8EB46*"
      },
      {
        "macaddress": "1C53F9*"
      }
    ],
    "iot_class": "cloud_push",
    "is_built_in": true
  },
  "data": {
    "subscriber": {
      "start": 1,
      "message_received": 7,
      "message_acked": 7
    },
    "devices": [
      {
        "data": {
          "name": "**REDACTED**",
          "type": "sdm.devices.types.DOORBELL",
          "assignee": "**REDACTED**",
          "traits": {
            "sdm.devices.traits.Info": {
              "customName": "**REDACTED**"
            },
            "sdm.devices.traits.CameraLiveStream": {
              "maxVideoResolution": {
                "width": 640,
                "height": 480
              },
              "videoCodecs": [
                "H264"
              ],
              "audioCodecs": [
                "AAC"
              ],
              "supportedProtocols": [
                "RTSP"
              ]
            },
            "sdm.devices.traits.CameraImage": {
              "maxImageResolution": {
                "width": 1920,
                "height": 1200
              }
            },
            "sdm.devices.traits.CameraPerson": {},
            "sdm.devices.traits.CameraSound": {},
            "sdm.devices.traits.CameraMotion": {},
            "sdm.devices.traits.CameraEventImage": {},
            "sdm.devices.traits.DoorbellChime": {}
          },
          "parentRelations": [
            {
              "parent": "**REDACTED**",
              "displayName": "**REDACTED**"
            }
          ]
        },
        "command": {
          "sdm.devices.commands.CameraLiveStream.GenerateRtspStream": 1
        },
        "event_media": {
          "event": 2,
          "event.new": 2,
          "event.fetch": 2,
          "fetch_image": 2,
          "fetch_image.skip": 2,
          "event.notify": 2
        }
      }
    ]
  }
}

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

crenus commented 2 years ago

@allenporter thanks for all your help!

At least interim solution that works, when you do the setup, you can choose which devices that you want to import. I just excluded my cameras and doorbell, this way the thermostats could still be imported. It's camera feeds/events bringing mine down

tcc0 commented 2 years ago

@crenus tried the same by only importing the nest v3 thermostat but the cpu and the temperature is rising like crazy after that, thanks for the tip tho

crenus commented 2 years ago

@tcc0 what pi are you running, and are you running a lot of other containers or other on it?

I moved off a pi3b+ -> pi 4 and from sd to an ssd all during this and it helped drastically overall regardless. My cpu and memory utilization were high before, running like 7 containers, daily backups, and few other services. Great to start out, but not the best for the long term

iliketoprogram14 commented 2 years ago

@allenporter would you mind posting this on the grpc github issue tracker too?

Edit: this issue seems possibly related. I wonder if downgrading grpcio to 1.44.0 could fix the issue?

allenporter commented 2 years ago

Looks like the same issue

Lackmake commented 1 year ago

I experienced heavy CPU load too on a pi 4 but it's with the google cloud plattform integration for tts. I assume that one uses the grpc library too but i didnt look into it. Should i open a new issue for what i think is the same problem or would this comment be suficient?

They also closed the issue in th grpc project a few minutes ago since they are "working to introduce a better model"

allenporter commented 1 year ago

Hi. I think the same underlying issue is the same, so fine to just track here. Yes, good that the grpc developers are aware of this and working on a solution.

dynasticorpheus commented 1 year ago

@allenporter Noticed latest grpc release has below fix incorporated so perhaps time to bump to 1.51.1 ?

Python Fix lack of cooldown between poll attempts. (https://github.com/grpc/grpc/pull/31550)

Cython was previously ignoring the value in the pxd file and zero-initializing the period. Meanwhile, an issue with our continuous benchmarks meant that this was not noticed until now. This should result in a significant decrease in CPU usage.

allenporter commented 1 year ago

Fantastic.That meets the description of the issue I was seeing when debugging.

Last time I bumped grpc there were some other dependency issues for protobuf 4. If these aren't resolved yet it may not just be a trivial bump, but either way I'll manage it.

Photoexploration commented 1 year ago

This is great news. Thank you so much for staying on top of this issue.

allenporter commented 1 year ago

grpc 1.51.1 was integrated in https://github.com/home-assistant/core/pull/83420

Perhaps we can confirm folks see this helps on a dev build and we can consider this closed.

allenporter commented 1 year ago

(I no longer have an arm device available for testing)

daxy01 commented 1 year ago

I'm afraid upgrading to 1.51.1 wasn't the solution after all :( I've upgraded to HA dev and confirmed grpc 1.51.1 is installed, but load isn't dropping just yet (it's been a couple of hours after I've upgraded). I have also disabled and re-enabled the project on the Device Access Console at Google, hoping to clear any backlog or cache there (if that's even a thing).

top - 14:12:28 up 2 days, 23:16,  2 users,  load average: 1.70, 1.24, 1.09
Tasks: 198 total,   1 running, 197 sleeping,   0 stopped,   0 zombie
%Cpu(s): 12.2 us, 11.8 sy,  0.0 ni, 75.6 id,  0.4 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   3844.3 total,    835.1 free,   1202.9 used,   1806.3 buff/cache
MiB Swap:    100.0 total,     98.7 free,      1.3 used.   2269.9 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
23750 root      20   0  491588 236772  48348 S  77.7   6.0 192:12.00 python3 -m homeassistant --config /config

Also this shows up in my log (has been there for quite some time, I think it's related to our issue):

2022-12-08 14:10:20.755 INFO (MainThread) [google_nest_sdm.event_media] Checking cache size 1010
2022-12-08 14:10:28.124 INFO (MainThread) [google_nest_sdm.event_media] Checking cache size 1010
2022-12-08 14:11:10.428 INFO (MainThread) [google_nest_sdm.event_media] Checking cache size 1011

allenporter commented 1 year ago

Thanks I'll remove the logging, that's a mistake.

Sorry, why do you think this log is related? I already diagnosed the root cause as being grpc overhead and this isn't it. It may just be a different grpc issue.

dynasticorpheus commented 1 year ago

I upgraded my 2022.12.3 instance to 1.51 musl wheels and sadly see a substantial increase in short and longer term load. Logs do not show anything strange and all seems to work fine. Any logging I can provide to help out?

allenporter commented 1 year ago

Hi, you need to use a profiler. See updates from me on this bug or otber related bugs where I've posted profiler results as examples.

DrMikeyS commented 1 year ago

Seeing the same issue. Bit of a noob question - how do i apply the fix?

allenporter commented 1 year ago

@DrMikeyS @dynasticorpheus the new version of grpc has not yet been integrated into home assistant production build.

DrMikeyS commented 1 year ago

Ah - understood. What are the typical time frames for such a bugfix being pushed into prod?

On Wed, 21 Dec 2022 at 18:08, Allen Porter @.***> wrote:

@DrMikeyS https://github.com/DrMikeyS @dynasticorpheus https://github.com/dynasticorpheus the new version of grpc has not yet been integrated into home assistant production build.

— Reply to this email directly, view it on GitHub https://github.com/home-assistant/core/issues/66983#issuecomment-1361788560, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB26HX62X2TFBPN75SULD4DWONBRFANCNFSM5O6POPZQ . You are receiving this because you were mentioned.Message ID: @.***>

dynasticorpheus commented 1 year ago

@allenporter I upgraded grpc manually using pre-build wheels on my prod instance as per my previous comment. Unsure if this is a proper test of course.

allenporter commented 1 year ago

Ah - understood. What are the typical time frames for such a bugfix being pushed into prod?

Home assistant releases happen monthly. (I don't believe this was tagged for a patch release)

DrMikeyS commented 1 year ago

Ah - understood. What are the typical time frames for such a bugfix being pushed into prod?

Home assistant releases happen monthly. (I don't believe this was tagged for a patch release)

Thanks for the update. And thanks for the work on the project.

As a workaround until the update is pushed out, I have limited the priority and capped the CPU usage of python on the pi. This has had no detrimental effect but keeps a bit of slack free for other processes and prevents it getting too hot/using more power than needed. top sudo cpulimit -p [Python PID] --limit 30 --background renice -n -12 -p [Python PID]

dynasticorpheus commented 1 year ago

@allenporter 2023.1.0b1 is showing even higher cpu load despite 1.51. To make things worse I can't be of any help as there are no py-spy musl wheels I can install on my armv7 box to provide logs :(

Lackmake commented 1 year ago

2023.01 also introduces a new Google Assistant SDK integration . Since it's another Google Cloud API thing, maybe it'll also suffer from the same problem

allenporter commented 1 year ago

@dynasticorpheus Thanks for the update. Python tracing won't really show the issue and it needs to be observed at a lower level. The good news is that grpc is rewriting their event engine so they are working on this, but its not a fast fix.

dynasticorpheus commented 1 year ago

@allenporter Simple HTOP kind of feedback but still wanted to share official 2023.1.0 release has calmed down on my end CPU usage wise! (compared to initial beta's)

jamie01 commented 1 year ago

2023.1.0 has also knocked power consumption on my (Intel) server back down to similar levels when Nest integration was disabled. Thanks for all the time and effort spent on this!

On Thu, 5 Jan 2023 at 07:59, dynasticorpheus @.***> wrote:

@allenporter https://github.com/allenporter Simple HTOP kind of feedback but still wanted to share official 2023.1.0 release has calmed down on my end CPU usage wise!

— Reply to this email directly, view it on GitHub https://github.com/home-assistant/core/issues/66983#issuecomment-1371890728, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTV5MNP4LFCUROKM7KAWQDWQZ5NNANCNFSM5O6POPZQ . You are receiving this because you commented.Message ID: @.***>

lukakama commented 1 year ago

Hi, I just updated HA (docker installation) to 2023.1.0, but my RPI 3 CPU usage has been increased from ~50% to ~100%... from what I can see from "htop", there are three HA threads which cause all the CPU load, with one of them constantly using 50% of the CPU, and the other two, at alternated times, using the remaining 50% .

(PIDs 48, 53 and 64)

Python profiler still shows nothing, so the issue is still in some native code:

I tried to retrieve some information for such threads adding "gdb" to the docker image and performing a "thread apply all bt" command on the HA's python process, and all three threads report the library "/usr/local/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-arm-linux-gnueabihf.so" in their stacks, so there are still some other issues in the grpc library:

(PID 53)
Thread 44 (LWP 53 "python3"):
#0  0x76f6fa6e in syscall () from /lib/ld-musl-armhf.so.1
#1  0x57a075e2 in ?? () from /usr/local/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-arm-linux-gnueabihf.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(PID 48 - took wile using CPU)
Thread 39 (LWP 48 "python3"):
#0  0x76f6fa6e in syscall () from /lib/ld-musl-armhf.so.1
#1  0x57a075e2 in ?? () from /usr/local/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-arm-linux-gnueabihf.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

(PID 64 - took wile using CPU)
Thread 53 (LWP 64 "python3"):
#0  0x76f6fa6c in syscall () from /lib/ld-musl-armhf.so.1
#1  0x57a075e2 in ?? () from /usr/local/lib/python3.10/site-packages/grpc/_cython/cygrpc.cpython-310-arm-linux-gnueabihf.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

I also tried to enable grpc logs (GRPC_TRACE=api and GRPC_VERBOSITY=info environmental variable), but it seems that the polling problem reported in the https://github.com/googleapis/python-pubsub/issues/728 has been fixed, as the container grpc library is 1.51.1 and logs reports a poll message every 200ms, as it should:

I0105 11:25:21.678587881      83 completion_queue.cc:948]              grpc_completion_queue_next(cq=0x5c00a0e0, deadline=gpr_timespec { tv_sec: 1672914321, tv_nsec: 878563767, clock_type: 1 }, reserved=0)
I0105 11:25:21.879211785      83 completion_queue.cc:948]              grpc_completion_queue_next(cq=0x5c00a0e0, deadline=gpr_timespec { tv_sec: 1672914322, tv_nsec: 79186160, clock_type: 1 }, reserved=0)
I0105 11:25:22.080915739      83 completion_queue.cc:948]              grpc_completion_queue_next(cq=0x5c00a0e0, deadline=gpr_timespec { tv_sec: 1672914322, tv_nsec: 280890114, clock_type: 1 }, reserved=0)
I0105 11:25:22.281308965      83 completion_queue.cc:948]              grpc_completion_queue_next(cq=0x5c00a0e0, deadline=gpr_timespec { tv_sec: 1672914322, tv_nsec: 481280945, clock_type: 1 }, reserved=0)
I0105 11:25:22.483128597      83 completion_queue.cc:948]              grpc_completion_queue_next(cq=0x5c00a0e0, deadline=gpr_timespec { tv_sec: 1672914322, tv_nsec: 683102347, clock_type: 1 }, reserved=0)

so there should still be something wrong on grpcimplementation...

lukakama commented 1 year ago

As an addition, I tried to profile both the main HA process and threads using the CPU with "strace -Tcfp ", and they always reports an high "% temp" on futex syscall, with also a lot of errors... :

(from main process)

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 57.69   38.438107        2645     14527     13052 futex
 11.76    7.838140      712558        11         3 restart_syscall
 10.05    6.697177      148826        45           _newselect
  8.33    5.549870      168177        33           poll
  8.13    5.415075       13272       408           epoll_pwait
  3.76    2.506349          47     53049           clock_gettime64
  0.07    0.046078          88       521           write
  0.05    0.035340          66       533           _llseek
[...]

(from the thread constantly using the CPU)

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 46.67   34.166324        2457     13904     12848 futex
 16.63   12.172873     1106624        11         4 restart_syscall
 16.35   11.971184      244309        49           poll
  8.70    6.367453      219567        29           _newselect
  7.78    5.698243       11086       514           epoll_pwait
  3.40    2.487209          47     52787           clock_gettime64
  0.20    0.145667         446       326           read
[...]

(from the thread using CPU intermittently)

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 51.00   22.481324        2182     10299      9492 futex
 20.65    9.102114      827464        11         4 restart_syscall
 14.25    6.281863      224352        28           _newselect
  8.63    3.802922        9320       408           epoll_pwait
  4.09    1.803817          46     39178           clock_gettime64
  1.12    0.491646        8779        56           poll
  0.04    0.019802         119       166        23 recvfrom
[...]

so, if I'm interpreting it correctly, it should confirm that there are still some issues on threading/polling management on grpc library on ARM.

Photoexploration commented 1 year ago

Well, this is much worse for me now. Previously I was at 5% CPU without Nest, jumping to 25% with it enabled. I re-enabled Nest this morning and CPU jumps to more than 40%! On a raspberry pi 4

DrMikeyS commented 1 year ago

I have moved to an Intel NUC and post the update the baseline CPU usage has dropped from 10% to 7% which is very welcome.

On Thu, 5 Jan 2023 at 11:38, Photoexploration @.***> wrote:

Well, this is much worse for me now. Previously I was at 5% CPU without Nest, jumping to 25% with it enabled. I re-enabled Nest this morning and CPU jumps to more than 40%! On a raspberry pi 4

— Reply to this email directly, view it on GitHub https://github.com/home-assistant/core/issues/66983#issuecomment-1372109368, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB26HX2MQBLP47EFT2JNZJLWQ2XE7ANCNFSM5O6POPZQ . You are receiving this because you were mentioned.Message ID: @.***>

tcc0 commented 1 year ago

Jumped from 20% to 35-40% here since a while.

Too bad my raspberry pi 3 is running 65-70 degrees 24/7 since a couple of months only so I can control my thermostat 1 or 2 times a day, should be working better

Photoexploration commented 1 year ago

I have been seeing a jump from 5% CPU without the Nest integration to 40% with the integration.

I have been running HomeAssistantOS 32 bit on a raspberry pi 4. The other day I migrated my HAOS to the 64 bit version and my CPU usage stays at 5% while running the Nest integration. Yay!

I don't know if this is already known but I thought it may be helpful to someone.

allenporter commented 1 year ago

@lukakama would you be up to file another issue with grpc? You have a pretty good profile data ,and may be able to give them the detail they need. (I don't currently have an rpi in my possession and haven't been able to get one due to the chip shortage)

lukakama commented 1 year ago

@lukakama would you be up to file another issue with grpc? You have a pretty good profile data ,and may be able to give them the detail they need. (I don't currently have an rpi in my possession and haven't been able to get one due to the chip shortage)

Done, sharing also another log from strace showing some issue in futex instruction usage (some sort of thread synchronization locking but with already expired absolute timeouts), which should help them to better trace the source of the issue.

However, I think I will move to a 64bit OS as soon as possible (thanks @Photoexploration ! ), in order to workaround the problem and reduce my RPi power usage.

rgerbranda commented 1 year ago

I switched to Raspberry Pi OS (64-bit) (thanks @Photoexploration ) and now the Nest integratrion is working fine! The CPU load is back to normal.

tcc0 commented 1 year ago

Did the same 64-Bit swap as rgerbranda, my cpu finally lowered to like 5% usage and the low 50's in temp since summer last year.

issue-triage-workflows[bot] commented 1 year ago

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.

rjosborne commented 1 year ago

Still an issue - I am running a new install on a raspberry pi 4, hosted via docker compose.

spotted high CPU last night and narrowed it down to the nest integration. CPU is around 30% with best enabled, 1-2% with it disabled.

i assume we are needing to wait for a fix further upstream based on the comment history, but I just wanted to confirm that it’s still an issue on the latest HA version.

gambalaya commented 1 year ago

This impacts me since I was using the old works-with-nest mode until support for that mode was removed in HA 2023.6.1. Of course we only had till Sept 29 to migrate anyway.

I am running HA 2023.6.2 on a raspberry pi 2b / armv7 hosted via docker compose, and CPU has jumped from 0.5% to 20% with this new mode with the same nest integration. As noted above it appears we are waiting for an upstream fix.

dynasticorpheus commented 1 year ago

@allenporter @bdraco I just manually applied #100314 (Bump grpcio to 1.58.0) on 2023.9.2 and the CPU load (and therefore temperature) has come down dramatically on my armv7 device.

bdraco commented 1 year ago

That's great news. The lib bump with come out with 2023.10.x

allenporter commented 10 months ago

Let's call this fixed as of 2023.10.x -- great that the bump fixed it.

home-assistant / core