home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
71.14k stars 29.81k forks source link

Roborock Integration `call_service` causes HomeAssistent stacktraces and AppDaemon "Internal Error" #124193

Open GaryBoone opened 3 weeks ago

GaryBoone commented 3 weeks ago

The problem

In an AppDaemon app, I have lines like:

    try:
      self.call_service("time/set_value", entity_id = "time.roborock_s7_maxv_do_not_disturb_begin", time = "20:00:00")
      self.call_service("time/set_value", entity_id = "time.roborock_s7_maxv_do_not_disturb_end", time = "08:00:00")
      self.call_service("switch/turn_on", entity_id = "switch.roborock_s7_maxv_do_not_disturb")
    except Exception as e:
      self.logger.error(f"Vacuum command failed: {str(e)}")

Although these commands often work, they often fail silently, meaning that they fail to raise an exception that the above try/except catches.

Instead, the appdaemon.log shows:

2024-08-18 15:31:05.300690 WARNING HASS: Error calling Home Assistant service default/time/set_value
2024-08-18 15:31:05.304254 WARNING HASS: Code: 500, error: 500 Internal Server Error

Server got itself in trouble

In the homeassistant.log, I see full stacktraces ending in errors like:

roborock.exceptions.RoborockTimeout: id=574409 Timeout after 4 seconds

[See the complete stacktrace below.]

There are at least 3 bugs here:

1) The Roborock API has a problem with the calls, or their inputs, or takes too long to respond, leading to the stacktrace shown below. If there's a problem with inputs, it fails to validate them.

2) The Roborock API, the HomeAssistant core, or the AppDaemon AddOn doesn't handle timeouts without bombing.

3) The problem appears in the Roborock AddIn code, but doesn't propagate to my AppDaemon code where I can handle it.

What version of Home Assistant Core has the issue?

core-2024.8.2

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

Roborock

Link to integration documentation on our website

https://www.home-assistant.io/integrations/roborock/

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

homeassistant.log:

2024-08-18 15:31:05.290 ERROR (MainThread) [aiohttp.server] Error handling request
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/site-packages/aiohttp/web_protocol.py", line 456, in _handle_request
    resp = await request_handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/web_app.py", line 537, in _handle
    resp = await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/aiohttp/web_middlewares.py", line 114, in impl
    return await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/http/security_filter.py", line 92, in security_filter_middleware
    return await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/http/forwarded.py", line 83, in forwarded_middleware
    return await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/http/request_context.py", line 26, in request_context_middleware
    return await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/http/ban.py", line 85, in ban_middleware
    return await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/http/auth.py", line 242, in auth_middleware
    return await handler(request)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/http/headers.py", line 32, in headers_middleware
    response = await handler(request)
               ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/http.py", line 73, in handle
    result = await handler(request, **request.match_info)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/api/__init__.py", line 430, in post
    response = await shield(
               ^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/core.py", line 2763, in async_call
    response_data = await coro
                    ^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/core.py", line 2806, in _execute_service
    return await target(service_call)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/service.py", line 999, in entity_service_call
    single_response = await _handle_entity_call(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/helpers/service.py", line 1071, in _handle_entity_call
    result = await task
             ^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/time/__init__.py", line 35, in _async_set_value
    return await entity.async_set_value(service_call.data[ATTR_TIME])
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/src/homeassistant/homeassistant/components/roborock/time.py", line 175, in async_set_value
    await self.entity_description.update_value(
  File "/usr/local/lib/python3.12/site-packages/roborock/version_1_apis/roborock_client_v1.py", line 111, in update_value
    response = await self.api._send_command(self.attribute.set_command, params)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/roborock/version_1_apis/roborock_local_client_v1.py", line 35, in _send_command
    return await self.send_message(roborock_message)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/roborock/version_1_apis/roborock_local_client_v1.py", line 57, in send_message
    (response, err) = await async_response
                      ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/site-packages/roborock/api.py", line 108, in _wait_response
    raise RoborockTimeout(f"id={request_id} Timeout after {self.queue_timeout} seconds") from None
roborock.exceptions.RoborockTimeout: id=574409 Timeout after 4 seconds

Additional information

As a workaround, I have code to retry the call_service calls, checking that the updated values are actually set. Until the issues reported above are fixed, others my find this helpful in creating reliable code:

  # Send `do not disturb` commands to the robot vacuum, checking that they are actually set 
  # to the new values. Works _only_ with the `do not disturb` commands. Requires that a variable 
  # `self.max_retries` is set.
  # Uses the RoboRock API function `call_service`:
  # https://appdaemon.readthedocs.io/en/latest/AD_API_REFERENCE.html#appdaemon.entity.Entity.call_service
  def command_robot(self, command, **kwargs):
    try:
      attempts = 0;
      while attempts < self.max_retries:
        self.logger.info("Sending vacuum command '{command}' with kwargs '{kwargs}'...")
        self.call_service(command, **kwargs)

        if 'entity_id' not in kwargs:
          self.logger.error(f"Vacuum command '{command}' with kwargs '{kwargs}' doesn't have an entity_id")
        entity_id = kwargs.get('entity_id', 'none')

        if entity_id == "time.roborock_s7_maxv_do_not_disturb_begin" or entity_id == "time.roborock_s7_maxv_do_not_disturb_end":
          if 'time' not in kwargs:
            self.logger.error(f"Vacuum command '{command}' with kwargs '{kwargs}' doesn't have a time")
          time_value = kwargs.get('time', 'none')
          if self.read_state(entity_id) == time_value:
            return

        if command == "switch/turn_on":
          if self.read_state("vacuum_mgr.roborock_dnd_enabled") == "on":
            return
        elif command == "switch/turn_off":
          if self.read_state("vacuum_mgr.roborock_dnd_enabled") == "off":
            return

        attempts = attempts + 1
        self.logger.error(f"Vacuum command attempt {attempts}' will be retried: {command}' with kwargs '{kwargs}'.")
        time.sleep(5)
    except Exception as e:
      self.logger.error(f"Vacuum command '{command}' with kwargs '{kwargs}' failed: {str(e)}")

    if attempts >= self.num_retries:
      self.logger.error(f"Vacuum command failed after {attempts} attempts: '{command}' with kwargs '{kwargs}' failed: {str(e)}")
home-assistant[bot] commented 3 weeks ago

Hey there @lash-l, mind taking a look at this issue as it has been labeled with an integration (roborock) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `roborock` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign roborock` Removes the current integration label and assignees on the issue, add the integration domain after the command. - `@home-assistant add-label needs-more-information` Add a label (needs-more-information, problem in dependency, problem in custom component) to the issue. - `@home-assistant remove-label needs-more-information` Remove a label (needs-more-information, problem in dependency, problem in custom component) on the issue.

(message by CodeOwnersMention)


roborock documentation roborock source (message by IssueLinks)

allenporter commented 3 weeks ago

Thanks for the detailed report. Regarding your 3 problems:

(1) the error appears as a timeout of 4 seconds. It looks like the local client sets a 4 second connection and the cloud client sets a 10 second connection. I don't have an opinion about whether or not that is too short in practice, but i'm include to leave this as is. Perhaps it could be increased. (Would rather set a higher timeout if we are willing to wait longer rather than retrying)

2 and 3 are the same issue where its not raising an error that home assistant can handle. That said, I don't have an y opinion on app daemon specifics since its not part of home assistant core.

I'm going to address 2/3 by sending a PR to improve error handling for some exception cases for these entities/services to match the behavior in some of the other entities/services.

Lash-L commented 3 weeks ago

Hey Allen, thanks for taking a look at this - as context - local requests always go through in under 2 seconds for me, hence the 4s timeout. Typically, the problem I have run into in the past isn't due to timeout but rather another error along the way. The python library has some shortcomings here, but if I'm remembering correctly, you can't really know if the request failed as it doesn't send a response.( I could be misremembering it has been a while since I've looked at that bit of the code and the requests)

As well, if one requests fails while others are also waiting for a response, they will all fail iirc.

Both of those things could be wrong and could potentially be something I've improved and forgotten about but just thought I'd give the context

allenporter commented 3 weeks ago

While i was testing with a local home assistant i saw one from the update coordinator fail taking longer than 4 seconds.

2024-08-20 03:55:07.528 ERROR (MainThread) [homeassistant.components.roborock.coordinator] Error fetching roborock data: id=25747 Timeout after 4 seconds
2024-08-20 03:55:38.675 INFO (MainThread) [homeassistant.components.roborock.coordinator] Fetching roborock data recovered

However maybe the wifi was flaky...

Lash-L commented 3 weeks ago

Yeah interesting. It's been my intention for a while to keep track of how long calls are taking and return that in diagnostic info. Potentially even dynamically change timeout and add in re attempting api calls. There is even a flag with commands along the lines of retry iirc but I have not fully reverse engineered it.

allenporter commented 3 weeks ago

If we should think about this more like a publish/subscribe system then maybe longer term the model needs to change a bit. Send the command and don't necessarily block on a result, but update state later once the message is received. (That said, i don't really understand the internals here yet)

allenporter commented 3 weeks ago

The issue about common/frequent timeouts we can track in the other existing issue #98013 so we'll treat this is about error handling.