home-assistant / supervisor

:house_with_garden: Home Assistant Supervisor
https://home-assistant.io/hassio/
Apache License 2.0
1.75k stars 639 forks source link

Multiple simultaneous updates fail, system locks up, reboot required #4341

Closed lutusp closed 1 year ago

lutusp commented 1 year ago

Describe the issue you are experiencing

If an attempt is made to perform more than one update simultaneously, the host system crashes and requires a reboot.

I've edited this report to include more detail. But the remedy for this bug is simple -- never allow more than one simultaneous update. At the beginning of any update, set a flag that prevents any additional updates until the single update is complete.

What operating system image do you use?

generic-x86-64 (Generic UEFI capable x86-64 systems)

What version of Home Assistant Operating System is installed?

Home Assistant OS 10.2

Did you upgrade the Operating System.

Yes

Steps to reproduce the issue

  1. Try to perform more than one update at the same time.
  2. Kick yourself for making an unwarranted assumption about an action that HA (a) doesn't prevent but (b) cannot tolerate.

Anything in the Supervisor logs that might be useful for us?

checked the log, nothing relevant, possibly because of time expiration

Anything in the Host logs that might be useful for us?

checked the log, nothing relevant, possibly because of time expiration

System information


version | core-2023.5.4
-- | --
installation_type | Home Assistant OS
dev | false
hassio | true
docker | true
user | root
virtualenv | false
python_version | 3.10.11
os_name | Linux
os_version | 6.1.29
arch | x86_64
timezone | America/Los_Angeles
config_dir | /config

<details><summary>Home Assistant Cloud</summary>

logged_in | false
-- | --
can_reach_cert_server | ok
can_reach_cloud_auth | ok
can_reach_cloud | ok

</details>

<details><summary>Home Assistant Supervisor</summary>

host_os | Home Assistant OS 10.2
-- | --
update_channel | stable
supervisor_version | supervisor-2023.04.1
agent_version | 1.5.1
docker_version | 23.0.6
disk_total | 3666.9 GB
disk_used | 6.5 GB
healthy | true
supported | true
board | generic-x86-64
supervisor_api | ok
version_api | ok
installed_addons | Terminal & SSH (9.7.1), Duck DNS (1.15.0), Mosquitto broker (6.2.1), Z-Wave JS (0.1.83)

</details>

<details><summary>Dashboards</summary>

dashboards | 1
-- | --
resources | 0
views | 5
mode | storage

</details>

<details><summary>Recorder</summary>

oldest_recorder_run | May 26, 2023 at 16:05
-- | --
current_recorder_run | June 3, 2023 at 09:38
estimated_db_size | 76.26 MiB
database_engine | sqlite
database_version | 3.40.1

</details>

Additional information

On-site now, I've edited this bug report, unfortunately days after the described event, and the available logs don't cover the event time.

agners commented 1 year ago

As with every system, there are resource constraints obviously. I quite often update multiple add-ons at once without problems (on a Yellow with 8GB of RAM). From my view, disallowing simultaneous updates generally would be a bad solution. It would be interesting to understand how/why the system crashed (I assume out of memory, but it could be something else). How much memory does your system have?

In any case, updates are coordinated by the Supervisor, so I've transferred this to the Supervisor repository.

lutusp commented 1 year ago

As with every system, there are resource constraints obviously. I quite often update multiple add-ons at once without problems (on a Yellow with 8GB of RAM). From my view, disallowing simultaneous updates generally would be a bad solution.

On the contrary, if this action can cause system lockups under any circumstances whatsoever, then preventing parallel updates is an obvious solution. An alternative would be to queue pending updates and only allow one to proceed at a time. In this case, all three updates proceeded in parallel, which in retrospect seems like a recipe for disaster.

I assumed that elementary precautions were in place, such as described above, but apparently not.

It would be interesting to understand how/why the system crashed (I assume out of memory, but it could be something else). How much memory does your system have?

The system is an Intel NUC NUC8i7BEH with 32 GB of RAM. HA is the only installed system.

In any case, updates are coordinated by the Supervisor, so I've transferred this to the Supervisor repository.

I assumed updates were performed at a more basic level, since not all HA installs have a supervisor. But thanks for moving this to a more appropriate domain.

mdegat01 commented 1 year ago

I also update multiple add-ons at hte same time regularly and I've never seen this. What were you trying to update at the same time?

I assumed updates were performed at a more basic level, since not all HA installs have a supervisor.

Only installs with supervisor are capable of updating themselves. On installs without supervisor users must manage updates of Home Assistant on their own. And there are no addons to update.

mdegat01 commented 1 year ago

An alternative would be to queue pending updates and only allow one to proceed at a time. In this case, all three updates proceeded in parallel, which in retrospect seems like a recipe for disaster.

Just to note - yes this is correct. I mean as a bugfix I'd probably just reject update requests if another one was already pending. But given more time to implement a proper enhancement, an update queue makes sense.

That being said, I would prefer to allow updates to proceed in parallel. Since I was under the impression they already could (and do use that feature). So steps to reproduce would really help.

lutusp commented 1 year ago

On Mon, Jun 5, 2023 at 9:00 AM Mike Degatano @.***> wrote:

I also update multiple add-ons at hte same time regularly and I've never seen this. What were you trying to update at the same time?

HAOS, supervisor and the Z-wave driver. I had been traveling and some time had passed since I had examined the state of the system. They all needed updating, so I made an assumption about the system's ability to manage multiple updates. WIthin the hour, the system went offline. After I returned and had physical access, I tried shutting the system down with a long power-button press, but even that didn't work. I had to remove power from the system.

Then, to my great annoyance, no relevant log entries.

I assumed updates were performed at a more basic level, since not all HA installs have a supervisor.

Only installs with supervisor are capable of updating themselves. On installs without supervisor users must manage updates of Home Assistant themselves. And there are no addons to update.

— Reply to this email directly, view it on GitHub https://github.com/home-assistant/supervisor/issues/4341#issuecomment-1577069481, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADVTB4N3FP4X5MCMMJOOAOTXJX7DPANCNFSM6AAAAAAY2VL4KM . You are receiving this because you authored the thread.Message ID: @.***>

-- Paul Lutus http://arachnoid.com

lutusp commented 1 year ago

On Mon, Jun 5, 2023 at 11:31 AM Mike Degatano @.***> wrote:

An alternative would be to queue pending updates and only allow one to proceed at a time. In this case, all three updates proceeded in parallel, which in retrospect seems like a recipe for disaster.

Just to note - yes this is correct. I mean as a bugfix I'd probably just reject update requests if another one was already pending. But given more time to implement a proper enhancement, an update queue makes sense.

That being said, I would prefer to allow updates to proceed in parallel. Since I was under the impression they already could (and do use that feature). So steps to reproduce would really help.

Yes, true, but as described elsewhere, no relevant log entries, so no way to trace the failure. This is why I think a prevention scheme is the best remedy under the circumstances -- some way to prevent multiple simultaneous updates.

— Reply to this email directly, view it on GitHub https://github.com/home-assistant/supervisor/issues/4341#issuecomment-1577273050, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADVTB4OE3P7T6LWPU5L3MTLXJYQXVANCNFSM6AAAAAAY2VL4KM . You are receiving this because you authored the thread.Message ID: @.***>

-- Paul Lutus http://arachnoid.com

agners commented 1 year ago

@mdegat01 updating Supervisor and HAOS at the same time could be an interesting combination I guess, as we don't know when exactly the OS starts rebooting after update :cold_face: It typically happens quite quickly, so the other updates might be in the middle of things...

That said, all updates should actually be resilient to interruptions, I'd say :thinking:

mdegat01 commented 1 year ago

updating Supervisor and HAOS at the same time could be an interesting combination I guess

@agners Not possible, at least not anymore. It used to be (I don't know how out of date the OPs system was) but now if supervisor is out of date all other updates are blocked until it is up to date. Including HAOS.

But that could be a problem for other things. We don't currently block updates of addons and plugins while OS is updating or vice versa. I thought OS had a timeout to give supervisor a chance to cleanly stop before killing it though? Supervisor won't come to a clean stop until those tasks have finished.

If we think that is a problem I can certainly put in something to prevent that.

github-actions[bot] commented 1 year ago

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.