home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
71.64k stars 29.94k forks source link

Nest integration high CPU usage on armv7 / raspberry pi in pubsub subscriber native code (fixed in 2023.10.x) #66983

Closed rgerbranda closed 9 months ago

rgerbranda commented 2 years ago

The problem

When I enable the Google Nest integration, I see a continous CPU load of python3 of about 65% Without this integration CPU load of python3 is just about 2%

Any recommendation to optimize the load of the Google Nest integration?

What version of Home Assistant Core has the issue?

2022.2.9

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant Container

Integration causing the issue

nest

Link to integration documentation on our website

https://www.home-assistant.io/integrations/nest

Diagnostics information

{
  "home_assistant": {
    "installation_type": "Home Assistant Container",
    "version": "2022.2.9",
    "dev": false,
    "hassio": false,
    "virtualenv": false,
    "python_version": "3.9.7",
    "docker": true,
    "arch": "armv7l",
    "timezone": "Europe/Brussels",
    "os_name": "Linux",
    "os_version": "5.10.63-v7l+",
    "run_as_root": true
  },
  "custom_components": {
    "config_editor": {
      "version": "3.0",
      "requirements": []
    },
    "hacs": {
      "version": "1.22.0",
      "requirements": [
        "aiogithubapi>=21.11.0"
      ]
    },
    "bosch": {
      "version": "0.17.3",
      "requirements": [
        "bosch-thermostat-client==0.17.3"
      ]
    },
    "afvalbeheer": {
      "version": "4.9.2",
      "requirements": [
        "rsa",
        "pycryptodome"
      ]
    },
    "authenticated": {
      "version": "21.9.0",
      "requirements": []
    }
  },
  "integration_manifest": {
    "domain": "nest",
    "name": "Nest",
    "config_flow": true,
    "dependencies": [
      "ffmpeg",
      "http",
      "media_source"
    ],
    "documentation": "https://www.home-assistant.io/integrations/nest",
    "requirements": [
      "python-nest==4.2.0",
      "google-nest-sdm==1.7.1"
    ],
    "codeowners": [
      "@allenporter"
    ],
    "quality_scale": "platinum",
    "dhcp": [
      {
        "macaddress": "18B430*"
      },
      {
        "macaddress": "641666*"
      },
      {
        "macaddress": "D8EB46*"
      },
      {
        "macaddress": "1C53F9*"
      }
    ],
    "iot_class": "cloud_push",
    "is_built_in": true
  },
  "data": {
    "subscriber": {
      "start": 1,
      "message_received": 7,
      "message_acked": 7
    },
    "devices": [
      {
        "data": {
          "name": "**REDACTED**",
          "type": "sdm.devices.types.DOORBELL",
          "assignee": "**REDACTED**",
          "traits": {
            "sdm.devices.traits.Info": {
              "customName": "**REDACTED**"
            },
            "sdm.devices.traits.CameraLiveStream": {
              "maxVideoResolution": {
                "width": 640,
                "height": 480
              },
              "videoCodecs": [
                "H264"
              ],
              "audioCodecs": [
                "AAC"
              ],
              "supportedProtocols": [
                "RTSP"
              ]
            },
            "sdm.devices.traits.CameraImage": {
              "maxImageResolution": {
                "width": 1920,
                "height": 1200
              }
            },
            "sdm.devices.traits.CameraPerson": {},
            "sdm.devices.traits.CameraSound": {},
            "sdm.devices.traits.CameraMotion": {},
            "sdm.devices.traits.CameraEventImage": {},
            "sdm.devices.traits.DoorbellChime": {}
          },
          "parentRelations": [
            {
              "parent": "**REDACTED**",
              "displayName": "**REDACTED**"
            }
          ]
        },
        "command": {
          "sdm.devices.commands.CameraLiveStream.GenerateRtspStream": 1
        },
        "event_media": {
          "event": 2,
          "event.new": 2,
          "event.fetch": 2,
          "fetch_image": 2,
          "fetch_image.skip": 2,
          "event.notify": 2
        }
      }
    ]
  }
}

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

probot-home-assistant[bot] commented 2 years ago

nest documentation nest source (message by IssueLinks)

probot-home-assistant[bot] commented 2 years ago

Hey there @allenporter, mind taking a look at this issue as it has been labeled with an integration (nest) you are listed as a code owner for? Thanks! (message by CodeOwnersMention)

allenporter commented 2 years ago

Hello, thank you for the report, and thank for you including the diagnostics, it is extremely helpful.

Quick question, how long was you home assistant instance active when the diagnostics data was downloaded? I'm trying to determine if this was a few seconds with of data/counters or many minutes to hours? (The activity is low, and does not explain the CPU usage)

allenporter commented 2 years ago

Also one other question, if you open the camera stream view in the frontend user interface is the Preload stream box checked?

rgerbranda commented 2 years ago

Hi @allenporter

The Home assistant instance was active for many minutes. Today I created a new instance with only the Nest integration. For this instance I see the same. CPU usage of about 65 to 70%

The Preload stream box is not checked.

Attached I have a zipped py-spy report of the instance that has only the Nest integration.

py-spy-2min.zip

allenporter commented 2 years ago

OK. what do you make of that report? It looks fairly empty to me:

So _worker is some generic unnamed thread, and we don't see any actual work happening, maybe since it is a future.

From what I can see there is one call to GenerateRtspStream which makes me think that the live stream is loaded and its processing the stream, or at least it loaded once for serving an image thumbnail.

halpatraz commented 2 years ago

Hi @allenporter -- I just wanted to add that I am seeing the same behavior described by @rgerbranda. CPU loads are sitting at 70-80% with the Nest integration enabled, but return to normal (<10%) if I disable it. I also have a doorbell.

I don't see anything obvious in the logs.

allenporter commented 2 years ago

Thank you. Can you confirm: Anything with debug logs turned on? Anything about streams at all? Can you also share you diagnostics information?

ibddot commented 2 years ago

I'm also seeing something similar. For me when the Nest integration is enabled, CPU usage is ~20% and with it disabled CPU usage is ~4%. My Nest account only has a thermostat though. I don't have a doorbell.

rgerbranda commented 2 years ago

Hi @allenporter

Attached I have the debug log

nest debug log.txt

allenporter commented 2 years ago

Thank you @rgerbranda -- I'm not seeing anything out of the ordinary here, however, the camera does appear reasonably busy with events (e.g. the camera has received >1k events in the last few days). I'm assuming you're using the home assistant frontend during this time since there are things like loading the media player or some other type of read of the events from disk, but again nothing beyond what I would expect.

Additional questions:

@ibddot and @halpatraz thank you for confirming you also see the issue i need additional information in the form of a much more detailed report including home assistant version, the architecture, any more details about what is being used, diagnostics information and logs if possible, as well as answering all of the questions about what is actually being used. I appreciate your time on this as i'm not able to reproduce this.

ibddot commented 2 years ago

Here is some more information on my system. I haven't found any errors or references in logs to Nest integration. Let me know what logs you are looking for and I'll try to get them. I had to use trial and error disabling things to find the source of the high CPU usage being the nest integration.

System Health

version core-2022.2.9
installation_type Home Assistant OS
dev false
hassio true
docker true
user root
virtualenv false
python_version 3.9.7
os_name Linux
os_version 5.10.92-v7
arch armv7l
timezone America/Los_Angeles
Home Assistant Cloud logged_in | false -- | -- can_reach_cert_server | ok can_reach_cloud_auth | ok can_reach_cloud | ok
Home Assistant Supervisor host_os | Home Assistant OS 7.4 -- | -- update_channel | stable supervisor_version | supervisor-2022.01.1 docker_version | 20.10.9 disk_total | 28.6 GB disk_used | 13.9 GB healthy | true supported | true board | rpi3 supervisor_api | ok version_api | ok installed_addons | File editor (5.3.3), Terminal & SSH (9.3.0), Samba share (9.5.1), Check Home Assistant configuration (3.10.0), NGINX Home Assistant SSL proxy (3.1.1), ESPHome (2022.2.5), Z-Wave JS to MQTT (0.35.1), Duck DNS (1.14.0)
Lovelace dashboards | 1 -- | -- resources | 0 views | 9 mode | storage
Nest Integration Diagnostics ```json { "home_assistant": { "installation_type": "Home Assistant OS", "version": "2022.2.9", "dev": false, "hassio": true, "virtualenv": false, "python_version": "3.9.7", "docker": true, "arch": "armv7l", "timezone": "America/Los_Angeles", "os_name": "Linux", "os_version": "5.10.92-v7", "supervisor": "2022.01.1", "host_os": "Home Assistant OS 7.4", "docker_version": "20.10.9", "chassis": "embedded", "run_as_root": true }, "custom_components": { "alexa_media": { "version": "3.10.15", "requirements": [ "alexapy==1.25.3", "packaging>=20.3", "wrapt>=1.12.1" ] } }, "integration_manifest": { "domain": "nest", "name": "Nest", "config_flow": true, "dependencies": [ "ffmpeg", "http", "media_source" ], "documentation": "https://www.home-assistant.io/integrations/nest", "requirements": [ "python-nest==4.2.0", "google-nest-sdm==1.7.1" ], "codeowners": [ "@allenporter" ], "quality_scale": "platinum", "dhcp": [ { "macaddress": "18B430*" }, { "macaddress": "641666*" }, { "macaddress": "D8EB46*" }, { "macaddress": "1C53F9*" } ], "iot_class": "cloud_push", "is_built_in": true }, "data": { "subscriber": { "start": 1, "message_received": 6, "message_acked": 6 }, "devices": [ { "data": { "name": "**REDACTED**", "type": "sdm.devices.types.THERMOSTAT", "assignee": "**REDACTED**", "traits": { "sdm.devices.traits.Info": { "customName": "**REDACTED**" }, "sdm.devices.traits.Humidity": { "ambientHumidityPercent": 43 }, "sdm.devices.traits.Connectivity": { "status": "ONLINE" }, "sdm.devices.traits.Fan": { "timerMode": "OFF" }, "sdm.devices.traits.ThermostatMode": { "mode": "HEAT", "availableModes": [ "HEAT", "OFF" ] }, "sdm.devices.traits.ThermostatEco": { "availableModes": [ "OFF", "MANUAL_ECO" ], "mode": "OFF", "heatCelsius": 15.646317, "coolCelsius": 24.444443 }, "sdm.devices.traits.ThermostatHvac": { "status": "HEATING" }, "sdm.devices.traits.Settings": { "temperatureScale": "CELSIUS" }, "sdm.devices.traits.ThermostatTemperatureSetpoint": { "heatCelsius": 19.847763 }, "sdm.devices.traits.Temperature": { "ambientTemperatureCelsius": 19.199997 } }, "parentRelations": [ { "parent": "**REDACTED**", "displayName": "**REDACTED**" } ] }, "event_media": { "event": 6 } } ] } } ```
allenporter commented 2 years ago

Is everyone seeing this using a raspberry pi? I'm wondering if there is something arm specific in grpc or cloud pubsub going off the rails, give those were upgraded in the last release and otherwise nest doesn't appear to be doing much. I can set up a dev environment to try to reproduce.

halpatraz commented 2 years ago

Yes, RPi 3B here. Using the Container, latest stable release.

rgerbranda commented 2 years ago

Also Raspberry PI here.

To compare I installed Home Assistant on an Intel device. CPU is almost 0 %

PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
192   189 root     S     396m   2%  11   0% python3 -m homeassistant --config /config
allenporter commented 2 years ago

I was able to reproduce this on armv7.

W/o nest:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                       
16573 root      20   0  225024 132616  28392 S  11.3   6.9   0:19.18 python3                       

nest enabled:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                       
17095 root      20   0  245936 138836  30724 S  61.5   7.2   0:49.52 python3 

I had to do some restarts in between. Thanks folks, that is helpful, now I think I have enough to investigate further.

allenporter commented 2 years ago

When I had a closer look and broke down by threads, they are just idle threads.

And really there is minimal difference between nest and not nest except for another idle subscriber thread or maybe two:

py-spy-no-nest

py-spy-nest

So i think the CPU usage may not be happening in python at all. My only thought is that something different is happening in cython with the upgrade of grpc. Will look at that next.

allenporter commented 2 years ago

I ran https://www.home-assistant.io/integrations/profiler/ which does not yet seem to produce anything super interesting. 90% of the cpu usage is in select/poll. profile.zip

allenporter commented 2 years ago

I think the next steps here are to investigate and profile the cython side of things, since the python side just has idle threads.

rgerbranda commented 2 years ago

I played with heartrate a bit, but it gave not much insight

allenporter commented 2 years ago

I saw some fixes in grpc 1.45 about default polling strategy, so as soon as that is released on pypi we can give that a try and see if it helps

allenporter commented 2 years ago

As of home assistant 2022.5.4 I see much lower cpu usage:

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                       
 5457 root      20   0  234352 126264  30252 S  27.8   6.6   0:26.76 python3  

Curious if folks are also seeing an improvement in 2022.5.X releases that includes grpc 1.45.0?

There are also some polling strategy changes in grpc 1.45.1 which I am curious to see the impact of. I've sent a PR to update to 1.46.1 to include those changes in the next home assistant release.

rgerbranda commented 2 years ago

I'm also running 2022.5.4 now, the CPU usage of python3 is around 40% with only the Nest integration enabled.

2022-05-16 09_56_28-Clipboard .

Photoexploration commented 2 years ago

I'm still seeing high CPU usage of around 20% when it is usually 1% without the Nest integration. Maybe it's a bit better on 2022.5.4 for me as it seemed to be around 30% CPU use previously.

allenporter commented 2 years ago

Curious to see now how this looks for you now in Home Assistant 2022.6 which includes grpcio==1.46.1?

I'll do some additional testing but wanted to get your take.

Photoexploration commented 2 years ago

Still at around 20-25% CPU usage...

rgerbranda commented 2 years ago

Same here, around 25%

allenporter commented 2 years ago

I did a bunch of further profiling and anecdotally looking at threads and what I see I believe is related to underlying details of grpc polling.

When i have an example python3 program that starts a python pubsub subscriber it starts 12 threads that are busy

$ GRPC_VERBOSITY=debug python3 ./example.py --project_id=XXX --subscription_id=YYY
I0701 07:36:43.961390731    2330 ev_epoll1_linux.cc:121]     grpc epoll fd: 3
D0701 07:36:43.961481969    2330 ev_posix.cc:171]            Using polling engine: epoll1
...

Then cpu usage goes up and it seems like there are a lot of busy threads doing stuff, e.g. they should be just sitting waiting for epoll but are very busy.

$ sudo strace -fp 32379
....
[pid 32393] futex(0xb46f7c30, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1656686216, tv_nsec=0}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 32393] syscall_0x193(0x1, 0xb3e739c0, 0x1, 0xb3e739c0, 0x193, 0xb3e73b40) = 0
[pid 32393] syscall_0x193(0, 0xb3e739b0, 0, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739b0, 0x1, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739f0, 0x1, 0xb3e739f0, 0x193, 0xb644f424) = 0
[pid 32393] futex(0xb46f7c30, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1656686216, tv_nsec=0}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 32393] syscall_0x193(0x1, 0xb3e739c0, 0x1, 0xb3e739c0, 0x193, 0xb3e73b40) = 0
[pid 32393] syscall_0x193(0, 0xb3e739b0, 0, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739b0, 0x1, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739f0, 0x1, 0xb3e739f0, 0x193, 0xb644f424) = 0
[pid 32393] futex(0xb46f7c30, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1656686216, tv_nsec=0}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 32393] syscall_0x193(0x1, 0xb3e739c0, 0x1, 0xb3e739c0, 0x193, 0xb3e73b40) = 0
[pid 32393] syscall_0x193(0, 0xb3e739b0, 0, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739b0, 0x1, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739f0, 0x1, 0xb3e739f0, 0x193, 0xb644f424) = 0
[pid 32393] futex(0xb46f7c30, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1656686216, tv_nsec=0}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 32393] syscall_0x193(0x1, 0xb3e739c0, 0x1, 0xb3e739c0, 0x193, 0xb3e73b40) = 0
[pid 32393] syscall_0x193(0, 0xb3e739b0, 0, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739b0, 0x1, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739f0, 0x1, 0xb3e739f0, 0x193, 0xb644f424) = 0
[pid 32393] futex(0xb46f7c30, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1656686216, tv_nsec=0}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 32393] syscall_0x193(0x1, 0xb3e739c0, 0x1, 0xb3e739c0, 0x193, 0xb3e73b40) = 0
[pid 32393] syscall_0x193(0, 0xb3e739b0, 0, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739b0, 0x1, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739f0, 0x1, 0xb3e739f0, 0x193, 0xb644f424) = 0
[pid 32393] futex(0xb46f7c30, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1656686216, tv_nsec=0}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 32393] syscall_0x193(0x1, 0xb3e739c0, 0x1, 0xb3e739c0, 0x193, 0xb3e73b40) = 0
[pid 32393] syscall_0x193(0, 0xb3e739b0, 0, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739b0, 0x1, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739f0, 0x1, 0xb3e739f0, 0x193, 0xb644f424) = 0
[pid 32393] futex(0xb46f7c30, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1656686216, tv_nsec=0}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 32393] syscall_0x193(0x1, 0xb3e739c0, 0x1, 0xb3e739c0, 0x193, 0xb3e73b40) = 0
[pid 32393] syscall_0x193(0, 0xb3e739b0, 0, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739b0, 0x1, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739f0, 0x1, 0xb3e739f0, 0x193, 0xb644f424) = 0
[pid 32393] futex(0xb46f7c30, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1656686216, tv_nsec=0}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 32393] syscall_0x193(0x1, 0xb3e739c0, 0x1, 0xb3e739c0, 0x193, 0xb3e73b40) = 0
[pid 32393] syscall_0x193(0, 0xb3e739b0, 0, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739b0, 0x1, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739f0, 0x1, 0xb3e739f0, 0x193, 0xb644f424) = 0
[pid 32393] futex(0xb46f7c30, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1656686216, tv_nsec=0}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 32393] syscall_0x193(0x1, 0xb3e739c0, 0x1, 0xb3e739c0, 0x193, 0xb3e73b40) = 0
[pid 32393] syscall_0x193(0, 0xb3e739b0, 0, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739b0, 0x1, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739f0, 0x1, 0xb3e739f0, 0x193, 0xb644f424) = 0
[pid 32393] futex(0xb46f7c30, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1656686216, tv_nsec=0}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 32393] syscall_0x193(0x1, 0xb3e739c0, 0x1, 0xb3e739c0, 0x193, 0xb3e73b40) = 0
[pid 32393] syscall_0x193(0, 0xb3e739b0, 0, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739b0, 0x1, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739f0, 0x1, 0xb3e739f0, 0x193, 0xb644f424) = 0
[pid 32393] futex(0xb46f7c30, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1656686216, tv_nsec=0}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 32393] syscall_0x193(0x1, 0xb3e739c0, 0x1, 0xb3e739c0, 0x193, 0xb3e73b40) = 0
[pid 32393] syscall_0x193(0, 0xb3e739b0, 0, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739b0, 0x1, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739f0, 0x1, 0xb3e739f0, 0x193, 0xb644f424) = 0
[pid 32393] futex(0xb46f7c30, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1656686216, tv_nsec=0}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 32393] syscall_0x193(0x1, 0xb3e739c0, 0x1, 0xb3e739c0, 0x193, 0xb3e73b40) = 0
[pid 32393] syscall_0x193(0, 0xb3e739b0, 0, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739b0, 0x1, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739f0, 0x1, 0xb3e739f0, 0x193, 0xb644f424) = 0
[pid 32393] futex(0xb46f7c30, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1656686216, tv_nsec=0}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 32393] syscall_0x193(0x1, 0xb3e739c0, 0x1, 0xb3e739c0, 0x193, 0xb3e73b40) = 0
[pid 32393] syscall_0x193(0, 0xb3e739b0, 0, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739b0, 0x1, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739f0, 0x1, 0xb3e739f0, 0x193, 0xb644f424) = 0
[pid 32393] futex(0xb46f7c30, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1656686216, tv_nsec=0}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out)
[pid 32393] syscall_0x193(0x1, 0xb3e739c0, 0x1, 0xb3e739c0, 0x193, 0xb3e73b40) = 0
[pid 32393] syscall_0x193(0, 0xb3e739b0, 0, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739b0, 0x1, 0xb3e739b0, 0x193, 0) = 0
[pid 32393] syscall_0x193(0x1, 0xb3e739f0, 0x1, 0xb3e739f0, 0x193, 0xb644f424) = 0
[pid 32393] futex(0xb46f7c30, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1656686216, tv_nsec=0}, FUTEX_BITSET_MATCH_ANY
allenporter commented 2 years ago

When in run the example with grpc debug turned on i'm seeing a call to grpc_completion_queue_next every few milliseconds?

$ GRPC_VERBOSITY=debug GRPC_TRACE=api python3 ./example.py --project_id=woodgreen-local-developer --subscription_id=home-assistant-Q8P60ivNwM
...
I0701 07:47:57.186653970    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 186645877, clock_type: 1 }, reserved=0)
I0701 07:47:57.187771076    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 187762947, clock_type: 1 }, reserved=0)
I0701 07:47:57.188889998    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 188882331, clock_type: 1 }, reserved=0)
I0701 07:47:57.190008679    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 190000271, clock_type: 1 }, reserved=0)
I0701 07:47:57.191126822    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 191118508, clock_type: 1 }, reserved=0)
I0701 07:47:57.192245762    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 192237596, clock_type: 1 }, reserved=0)
I0701 07:47:57.193362258    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 193354165, clock_type: 1 }, reserved=0)
I0701 07:47:57.194478827    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 194471068, clock_type: 1 }, reserved=0)
I0701 07:47:57.195596101    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 195587804, clock_type: 1 }, reserved=0)
I0701 07:47:57.196711522    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 196703485, clock_type: 1 }, reserved=0)
I0701 07:47:57.197829092    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 197820814, clock_type: 1 }, reserved=0)
I0701 07:47:57.198944161    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 198936698, clock_type: 1 }, reserved=0)
I0701 07:47:57.200061953    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 200053138, clock_type: 1 }, reserved=0)
I0701 07:47:57.201180023    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 201172134, clock_type: 1 }, reserved=0)
I0701 07:47:57.202298833    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 202290611, clock_type: 1 }, reserved=0)
I0701 07:47:57.203415569    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 203407532, clock_type: 1 }, reserved=0)
I0701 07:47:57.204532194    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 204523750, clock_type: 1 }, reserved=0)
I0701 07:47:57.205648449    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 205640709, clock_type: 1 }, reserved=0)
I0701 07:47:57.206764926    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 206756630, clock_type: 1 }, reserved=0)
I0701 07:47:57.207881996    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 207874033, clock_type: 1 }, reserved=0)
I0701 07:47:57.208998732    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 208990769, clock_type: 1 }, reserved=0)
I0701 07:47:57.210118191    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 210109468, clock_type: 1 }, reserved=0)
I0701 07:47:57.211237149    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 211228353, clock_type: 1 }, reserved=0)
I0701 07:47:57.212356441    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 212348348, clock_type: 1 }, reserved=0)
I0701 07:47:57.213474251    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 213465899, clock_type: 1 }, reserved=0)
I0701 07:47:57.214591043    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 214582950, clock_type: 1 }, reserved=0)
I0701 07:47:57.215708446    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 215700076, clock_type: 1 }, reserved=0)
I0701 07:47:57.216826997    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 216819330, clock_type: 1 }, reserved=0)
I0701 07:47:57.217945974    2482 completion_queue.cc:961]    grpc_completion_queue_next(cq=0xb48a2bf0, deadline=gpr_timespec { tv_sec: 1656686877, tv_nsec: 217936974, clock_type: 1 }, reserved=0)
allenporter commented 2 years ago

When I run the sample example on an intel machine, i only see this every tens/hundreds of milliseconds:

I0701 08:07:52.176668681  266838 completion_queue.cc:962]    grpc_completion_queue_next(cq=0x2d063d0, deadline=gpr_timespec { tv_sec: 1656688072, tv_nsec: 376658335, clock_type: 1 }, reserved=(nil))
I0701 08:07:52.378082235  266838 completion_queue.cc:962]    grpc_completion_queue_next(cq=0x2d063d0, deadline=gpr_timespec { tv_sec: 1656688072, tv_nsec: 578073562, clock_type: 1 }, reserved=(nil))
I0701 08:07:52.579491408  266838 completion_queue.cc:962]    grpc_completion_queue_next(cq=0x2d063d0, deadline=gpr_timespec { tv_sec: 1656688072, tv_nsec: 779483197, clock_type: 1 }, reserved=(nil))
I0701 08:07:52.780896218  266838 completion_queue.cc:962]    grpc_completion_queue_next(cq=0x2d063d0, deadline=gpr_timespec { tv_sec: 1656688072, tv_nsec: 980887103, clock_type: 1 }, reserved=(nil))
I0701 08:07:52.982330129  266838 completion_queue.cc:962]    grpc_completion_queue_next(cq=0x2d063d0, deadline=gpr_timespec { tv_sec: 1656688073, tv_nsec: 182321812, clock_type: 1 }, reserved=(nil))
I0701 08:07:53.183738957  266838 completion_queue.cc:962]    grpc_completion_queue_next(cq=0x2d063d0, deadline=gpr_timespec { tv_sec: 1656688073, tv_nsec: 383732918, clock_type: 1 }, reserved=(nil))
I0701 08:07:53.385128496  266838 completion_queue.cc:962]    grpc_completion_queue_next(cq=0x2d063d0, deadline=gpr_timespec { tv_sec: 1656688073, tv_nsec: 585119089, clock_type: 1 }, reserved=(nil))
I0701 08:07:53.585531261  266838 completion_queue.cc:962]    grpc_completion_queue_next(cq=0x2d063d0, deadline=gpr_timespec { tv_sec: 1656688073, tv_nsec: 785522039, clock_type: 1 }, reserved=(nil))
I0701 08:07:53.786987125  266838 completion_queue.cc:962]    grpc_completion_queue_next(cq=0x2d063d0, deadline=gpr_timespec { tv_sec: 1656688073, tv_nsec: 986978495, clock_type: 1 }, reserved=(nil))
I0701 08:07:53.988393263  266838 completion_queue.cc:962]    grpc_completion_queue_next(cq=0x2d063d0, deadline=gpr_timespec { tv_sec: 1656688074, tv_nsec: 188383389, clock_type: 1 }, reserved=(nil))
I0701 08:07:54.189824787  266838 completion_queue.cc:962]    grpc_completion_queue_next(cq=0x2d063d0, deadline=gpr_timespec { tv_sec: 1656688074, tv_nsec: 389816478, clock_type: 1 }, reserved=(nil))
I0701 08:07:54.391226093  266838 completion_queue.cc:962]    grpc_completion_queue_next(cq=0x2d063d0, deadline=gpr_timespec { tv_sec: 1656688074, tv_nsec: 591217373, clock_type: 1 }, reserved=(nil))
I0701 08:07:54.591705781  266838 completion_queue.cc:962]    grpc_completion_queue_next(cq=0x2d063d0, deadline=gpr_timespec { tv_sec: 1656688074, tv_nsec: 791696467, clock_type: 1 }, reserved=(nil))
I0701 08:07:54.793094829  266838 completion_queue.cc:962]    grpc_completion_queue_next(cq=0x2d063d0, deadline=gpr_timespec { tv_sec: 1656688074, tv_nsec: 993086806, clock_type: 1 }, reserved=(nil))
I0701 08:07:54.994482102  266838 completion_queue.cc:962]    grpc_completion_queue_next(cq=0x2d063d0, deadline=gpr_timespec { tv_sec: 1656688075, tv_nsec: 194472856, clock_type: 1 }, reserved=(nil))
I0701 08:07:55.195892478  266838 completion_queue.cc:962]    grpc_completion_queue_next(cq=0x2d063d0, deadline=gpr_timespec { tv_sec: 1656688075, tv_nsec: 395885849, clock_type: 1 }, reserved=(nil))
I0701 08:07:55.397302794  266838 completion_queue.cc:962]    grpc_completion_queue_next(cq=0x2d063d0, deadline=gpr_timespec { tv_sec: 1656688075, tv_nsec: 597293498, clock_type: 1 }, reserved=(nil))
^CI0701 08:07:55.545513133  266830 call.cc:648]                grpc_call_cancel_with_status(c=0x7f73b4001ab0, status=1, description=Channel closed!, reserved=(nil))
...
allenporter commented 2 years ago

So i think this indicates to me this is either grpc or streaming pubsub armv7 issue

allenporter commented 2 years ago

I have an example repro case based on the python pubsub doc example:

"""Example reproducing grpc cpu issues."""

import argparse
import logging

from concurrent.futures import TimeoutError
from google.cloud import pubsub_v1

parser = argparse.ArgumentParser(
    description="Command line tool for Google Nest SDM API"
)
parser.add_argument("--project_id", required=True, help="Project id")
parser.add_argument("--subscription_id", required=True, help="Subscription id")
parser.add_argument("--timeout", default=90, help="Duration to listen for messages")
parser.add_argument( '-log',
                     '--loglevel',
                     default='warning',
                     help='Provide logging level. Example --loglevel debug, default=warning' )

TIMEOUT = 90

def main() -> None:
    """Nest command line tool."""
    args: argparse.Namespace = parser.parse_args()
    logging.basicConfig( level=args.loglevel.upper() )

    subscriber = pubsub_v1.SubscriberClient()
    # The `subscription_path` method creates a fully qualified identifier
    # in the form `projects/{args.project_id}/subscriptions/{args.subscription_id}`
    subscription_path = subscriber.subscription_path(args.project_id, args.subscription_id)

    def callback(message: pubsub_v1.subscriber.message.Message) -> None:
        print(f"Received {message}.")
        message.ack()

    streaming_pull_future = subscriber.subscribe(subscription_path, callback=callback)
    print(f"Listening for messages on {subscription_path}..\n")

    # Wrap subscriber in a 'with' block to automatically call close() when done.
    with subscriber:
        try:
            # When `timeout` is not set, result() will block indefinitely,
            # unless an exception is encountered first.
            streaming_pull_future.result(timeout=args.timeout)
        except TimeoutError:
            streaming_pull_future.cancel()  # Trigger the shutdown.
            streaming_pull_future.result()  # Block until the shutdown is complete.

Run with:

$ export GOOGLE_APPLICATION_CREDENTIALS=/tmp/service-account.json
$ GRPC_VERBOSITY=debug GRPC_TRACE=api python3 /tmp/example.py --project_id=XXXXX--subscription_id=YYYY --loglevel=debug 
allenporter commented 2 years ago

I've filed https://github.com/googleapis/python-pubsub/issues/728 upstream with pubsub though i suspect this is really a grpc issue.

rgerbranda commented 2 years ago

Thanks for the good work.

tcc0 commented 2 years ago

Any updates on this ? My pi cpu is still running at 70g celcius all day doing nothing

allenporter commented 2 years ago

See my last update above for the latest status

tcc0 commented 2 years ago

They ignore you I see.

Anyway, when restarting HA the cpu usage is lower than 20%, after a while it's generating more cpu, still very annoying and doesn't look like the problem's gonna be resolved anytime soon

tcc0 commented 2 years ago

Can you bump some more ? My CPU is running 75,8c doing nothing 24/7 (!) :(

jamie01 commented 2 years ago

Adding to this thread just as an additional data point, although the effects I notice are far less obvious.

I am using HAOS as a VM on an Intel machine, through KVM. So no ARM involved. I notice that the CPU usage of the VM as reported by the host is consistently between 5-10% when the Nest integration is enabled, while it sits around 1% with it disabled.

I tracked this down simply because the power usage of the server is increased somewhat due to a significantly higher number of wake cycles on the VM when Nest is enabled, so although the increased CPU usage sounds minimal it has a meaningful impact above and beyond the dozen or so other integrations I have enabled.

My profile data is similarly empty like that posted earlier, consisting mainly of epoll events and no clear difference between the Nest integration being enabled/disabled.

tomhampshire commented 2 years ago

Hi, I am having a very similar issue (high CPU usage when the nest integration is on). I'm running the homeassistant docker image on a 32-bit rPi4. I've noticed we both have the nest doorbell - perhaps this is causing the issue? I've attached my diagnostics: EDIT: as an aside, my doorbell (‎GWX3T) is on battery and seems to be "streaming" an empty, black RTC video stream, which seems to affect the battery life.

{
  "home_assistant": {
    "installation_type": "Home Assistant Container",
    "version": "2022.8.7",
    "dev": false,
    "hassio": false,
    "virtualenv": false,
    "python_version": "3.10.5",
    "docker": true,
    "arch": "armv7l",
    "timezone": "Europe/London",
    "os_name": "Linux",
    "os_version": "5.10.103-v7l+",
    "run_as_root": true
  },
  "custom_components": {},
  "integration_manifest": {
    "domain": "nest",
    "name": "Nest",
    "config_flow": true,
    "dependencies": [
      "ffmpeg",
      "http",
      "application_credentials",
      "repairs"
    ],
    "after_dependencies": [
      "media_source"
    ],
    "documentation": "https://www.home-assistant.io/integrations/nest",
    "requirements": [
      "python-nest==4.2.0",
      "google-nest-sdm==2.0.0"
    ],
    "codeowners": [
      "@allenporter"
    ],
    "quality_scale": "platinum",
    "dhcp": [
      {
        "macaddress": "18B430*"
      },
      {
        "macaddress": "641666*"
      },
      {
        "macaddress": "D8EB46*"
      }
    ],
    "iot_class": "cloud_push",
    "loggers": [
      "google_nest_sdm",
      "nest"
    ],
    "is_built_in": true
  },
  "data": {
    "subscriber": {
      "start": 1
    },
    "devices": [
      {
        "data": {
          "name": "**REDACTED**",
          "type": "sdm.devices.types.DOORBELL",
          "assignee": "**REDACTED**",
          "traits": {
            "sdm.devices.traits.Info": {
              "customName": "**REDACTED**"
            },
            "sdm.devices.traits.CameraLiveStream": {
              "videoCodecs": [
                "H264"
              ],
              "audioCodecs": [
                "OPUS"
              ],
              "supportedProtocols": [
                "WEB_RTC"
              ]
            },
            "sdm.devices.traits.CameraImage": {
              "maxImageResolution": {
                "width": 1920,
                "height": 1200
              }
            },
            "sdm.devices.traits.CameraPerson": {},
            "sdm.devices.traits.CameraMotion": {},
            "sdm.devices.traits.DoorbellChime": {},
            "sdm.devices.traits.CameraClipPreview": {}
          },
          "parentRelations": [
            {
              "parent": "**REDACTED**",
              "displayName": "**REDACTED**"
            }
          ]
        }
      }
    ],
    "camera": {
      "camera.doorbell": {}
    }
  }
}
tcc0 commented 2 years ago

I don't have a doorbell, but do have a thermostat and cam iq outdoor

allenporter commented 2 years ago

Hi folks, we have enough information here as mentioned above. Underlying issue is: https://github.com/googleapis/python-pubsub/issues/728

tcc0 commented 1 year ago

Is this ever going to be fixed ?

"To be frank, I haven't really looked into this issue in much detail" doesn't sound very hopefull

Thnx

michaeldpj commented 1 year ago

I am also experiencing this as I have 4 Nest cameras and my Pi runs with a high cpu python3 process (80%-100%). It's unfortunate to see little to no interest in trying to fix this from Google.

cmrenus commented 1 year ago

+1 to keep alive I have a few hubs, cameres, 2 thermostats and doorbell and my load average is in the 50s on an rpi 4.

Previously was running on a 3b+ and things ran fine. So not sure why the hardware change made a difference but only started happening after then

allenporter commented 1 year ago

It seems like a low level bug related to timers specific to the instruction set. Likely not the biggest use case for the python library, and likely requires expertise, interest, and ownership of one of these devices to be able to reproduce. It's an open source project after all.

tcc0 commented 1 year ago

@frenck help?

allenporter commented 1 year ago

@tcc0 I don't think that's the right answer

tcc0 commented 1 year ago

The right answer is I disabled the addon for now, running 100% cpu for months on a row is unacceptable, hoped a little more effort from the devs

allenporter commented 1 year ago

Hey! Remember this is a volunteer project. I bunch a pretty significant amount of effort into diagnosing and profiling this, finding a potential cause, and filed the issue in the upstream project.