home-assistant / core

:house_with_garden: Open source home automation that puts local control and privacy first.
https://www.home-assistant.io
Apache License 2.0
71.27k stars 29.86k forks source link

Automations stay in "still running" mode #103702

Closed derjoerg closed 4 months ago

derjoerg commented 10 months ago

The problem

Hi, after upgrading to from 2023.10 to 2023.11 I have the problem that all my automations, which use signal_messenger are stuck in "still running" mode after the first run. I need to manually deactivate and activate the automation to keep it working, but only for one run. When I change the service-call from signal_messenger to e.g. notify.mobile_app the automations are working as expected.

Below the trace of the first run of the automation, when the signal message is going through 01_away.json

and here the trace of a second run o the automation, when nothing is working anymore 02_return.json

What else can I provide to help on troubleshooting that?

What version of Home Assistant Core has the issue?

core-2023.11.1

What was the last working version of Home Assistant Core?

No response

What type of installation are you running?

Home Assistant OS

Integration causing the issue

Signal Messenger

Link to integration documentation on our website

https://www.home-assistant.io/integrations/signal_messenger/

Diagnostics information

No response

Example YAML snippet

No response

Anything in the logs that might be useful for us?

No response

Additional information

No response

home-assistant[bot] commented 10 months ago

Hey there @bbernhard, mind taking a look at this issue as it has been labeled with an integration (signal_messenger) you are listed as a code owner for? Thanks!

Code owner commands Code owners of `signal_messenger` can trigger bot actions by commenting: - `@home-assistant close` Closes the issue. - `@home-assistant rename Awesome new title` Renames the issue. - `@home-assistant reopen` Reopen the issue. - `@home-assistant unassign signal_messenger` Removes the current integration label and assignees on the issue, add the integration domain after the command.

(message by CodeOwnersMention)


signal_messenger documentation signal_messenger source (message by IssueLinks)

derjoerg commented 10 months ago

I saw similar issues reported here and also in the forum, but they were all related to zha or zwave.

Additionally someone mentioned a change to a timeout parameter for service-calls, but I don't know if this is related.

derjoerg commented 10 months ago

I just add as much info as I can gather. Hopefully it helps: I also have some alerts setup, which send out notifications over signal_messenger and they are working as expected Adding "continue_on_error" to the service-call doesn't help

derjoerg commented 10 months ago

A possible workaround would be to create a script, which calls the signal_messenger notify and start this script from an automation with script.turn_on. Then the automations do not hang.

But it is only a workaround and not the source of the issue

derjoerg commented 10 months ago

I tested the above mentioned workaround over the weekend, but this is also not working. It is true that the automation do not hang, when such a script is called, but the script itself never comes to an end and it looks like it fills up an internal queue and after a specific amount of "running" scripts HA will become unresponsive.

Additionally I tested the same now with telegram and this is working as expected.

renini commented 9 months ago

I have to the exact same thing. Are you also using mode: json-rpc in the signal addon? Not sure if this would make a difference but I am going to try mode: native

Although this sounds more like something related to HA core stuff while reading the other issues. I tried to debug some put couldn't really put a finger on the issue...

derjoerg commented 9 months ago

I have to the exact same thing. Are you also using mode: json-rpc in the signal addon? Not sure if this would make a difference but I am going to try mode: native

Hi @renini , thanks for this tip. I changed the mode now also from "json-rpc" to "native" and till now it looks promising. I will keep an eye on it the next days.

viperk1 commented 9 months ago

I have the exact same problem. Automations stuck in 'still running' state after sending a signal message. I just changed to native mode and will test over the next little while.

GatoPharaoh commented 7 months ago

what is the status on this issue? - just started noticing that my automations containing signal notifications get stuck.

did native mode "fix" this for you? how can we get a real fix going?

derjoerg commented 7 months ago

Hi @GatoPharaoh, after changing the add-on config to "native" mode the automations are running normally now and I had no single "stuck".

Regarding a possible fix ... As you see in the comments above, no single reaction from the code owner :(

bbernhard commented 7 months ago

If the automations work after switching to the native mode, then the json-rpc mode is/was to blame here.

Do you use the Home Assistant addon or the signal-cli-rest-api container?

Most of the time it is either the signal-cli-rest-api addon/docker container which is to blame here. (The Signal service changes quite often and sometimes they introduce changes which make the addon/docker container break).

There were some problems with the json-rpc mode inthe past, but with the latest rework it should be pretty stable.

derjoerg commented 7 months ago

Hi, Thanks for the feedback. I'm using the HA-addon in version 0.80.0.

bbernhard commented 7 months ago

And it still hangs in jspn-rpc mode but not in native mode? If so, please try to use curl to send a message. Does it also hang there?

GatoPharaoh commented 7 months ago

just noticed I was still on 0.7 - just updated and will test if this fixes the problem

as for the issue itself - I tried using curl there were no issues with sending messages

in fact, home assistant itself was sending the message inside an automation without a problem - but just ONCE. after home assistant succesfully sent the automated message, the automation got stuck and said "still running" - implying it was waiting for some kind of response from signal_messenger that it never got. having those automations in single mode means that automation won't run again, until "still running" gets "unstuck"

reloading the automations or restarting home assistant got the automations unstuck. i also just noticed that they got "unstuck" due to me updating the signal-cli-rest-api container which home assistant saw as an error (see picutre below). the message in the picutre got sent and the automation was stuck from jan 22 until now (due to me updating the container)

hope this helps.

grafik

GatoPharaoh commented 7 months ago

also, see petros comment on the forums - it looks like they know where the problem is - but the integrations have to implement a fix

https://community.home-assistant.io/t/2023-8-update-breaks-automations/598675/221

bbernhard commented 7 months ago

Do you see the same behavior as @derjoerg that changing the mode to native fixes the issue? Because actually there shouldn't be any differences between the json-rpc mode and the native mode - at least regarding the outward facing interface. They should behave exactly the same in that regard.

also, see petros comment on the forums - it looks like they know where the problem is - but the integrations have to implement a fix

I am not sure if I completely understand what they were discussing there. But to me it seems that the "automation gets stuck issue" should only happen when the service takes more than 10 seconds to respond - which shouldn't happen with neither the json-rpc nor the native mode.

The only thing I can think of where the request might take longer than 10 seconds is, if you restart the signal-cli-rest-api docker container/addon and immediately afterwards try to send a message. If you run the signal-cli-rest-api service in json-rpc mode, the java application is used instead of the natively compiled binary (which is used in the native mode). The java application takes quite a while until it starts up and could easily exceed the 10 seconds max timeout (especially on systems like the raspberry pi).

In order to try that, you could do the following:

GatoPharaoh commented 7 months ago

Do you see the same behavior as @derjoerg that changing the mode to native fixes the issue? Because actually there shouldn't be any differences between the json-rpc mode and the native mode - at least regarding the outward facing interface. They should behave exactly the same in that regard.

i have not tried native mode yet - thing is, I can't even reproduce it consistently. it just happens at some point. i have updated the container to 0.80 and will test over the coming days if the issue returns.

I am not sure if I completely understand what they were discussing there. But to me it seems that the "automation gets stuck issue" should only happen when the service takes more than 10 seconds to respond - which shouldn't happen with neither the json-rpc nor the native mode.

yeah, it's a little weird how the whole discussion around the "still running" issue is handled - after some searching i found the changes to the zwave integration they mentioned that was affected as well - don't know if this gives you some insight: https://github.com/home-assistant-libs/zwave-js-server-python/pull/758/commits

The only thing I can think of where the request might take longer than 10 seconds is, if you restart the signal-cli-rest-api docker container/addon and immediately afterwards try to send a message. If you run the signal-cli-rest-api service in json-rpc mode, the java application is used instead of the natively compiled binary (which is used in the native mode). The java application takes quite a while until it starts up and could easily exceed the 10 seconds max timeout (especially on systems like the raspberry pi).

thing is, my signal-cli-rest-api docker has been running for 3 months straight - i never restarted it since the last update. also, even though automations get stuck when using signal_messenger, sending a message via the services tab in home assistant still works without issues. other automations with signal_messenger work as well - until they just get stuck for unknown reasons. so the container itself seems to work as expected.

i'll keep an eye on my automations in the coming days and will report back. if the issue returns, i'll try native mode as well.

bbernhard commented 7 months ago

Really strange. Unfortunately I do not know anything about how Home Assistant automations are implemented. So it is kinda hard to figure out what needs to be changed in order to fix the issue in the integration :/.

I also had a quick look at the zwave diff, but couldn't spot anything useful. If anyone has some additional knowledge, please let me know.

GatoPharaoh commented 7 months ago

well, it happened again - so container-version 0.80 on json-rpc mode still has the issue. just checked the logs of the container, but doesn't look like it has much to offer

grafik

in /var/log/signal-cli-json-rpc-1/out.log

WARN  ManagerImpl - No profile name set. When sending a message it's recommended to set a profile name with the updateProfile command. This may become mandatory in the future.
WARN  ManagerImpl - No profile name set. When sending a message it's recommended to set a profile name with the updateProfile command. This may become mandatory in the future.
WARN  ManagerImpl - No profile name set. When sending a message it's recommended to set a profile name with the updateProfile command. This may become mandatory in the future.
WARN  ManagerImpl - No profile name set. When sending a message it's recommended to set a profile name with the updateProfile command. This may become mandatory in the future.
WARN  ManagerImpl - No profile name set. When sending a message it's recommended to set a profile name with the updateProfile command. This may become mandatory in the future.
WARN  ManagerImpl - No profile name set. When sending a message it's recommended to set a profile name with the updateProfile command. This may become mandatory in the future.
WARN  ReceiveHelper - Connection closed unexpectedly, reconnecting in 100 ms
WARN  ReceiveHelper - Connection closed unexpectedly, reconnecting in 100 ms
WARN  SignalWebSocketHealthMonitor - Missed keep alives, identified last: 1706005190976 unidentified last: 1706005190976 needed by: 1706005220840
WARN  ManagerImpl - No profile name set. When sending a message it's recommended to set a profile name with the updateProfile command. This may become mandatory in the future.
WARN  ManagerImpl - No profile name set. When sending a message it's recommended to set a profile name with the updateProfile command. This may become mandatory in the future.
WARN  ManagerImpl - No profile name set. When sending a message it's recommended to set a profile name with the updateProfile command. This may become mandatory in the future.
WARN  ManagerImpl - No profile name set. When sending a message it's recommended to set a profile name with the updateProfile command. This may become mandatory in the future.
WARN  ManagerImpl - No profile name set. When sending a message it's recommended to set a profile name with the updateProfile command. This may become mandatory in the future.
WARN  ManagerImpl - No profile name set. When sending a message it's recommended to set a profile name with the updateProfile command. This may become mandatory in the future.
WARN  ManagerImpl - No profile name set. When sending a message it's recommended to set a profile name with the updateProfile command. This may become mandatory in the future.
WARN  ReceiveHelper - Connection closed unexpectedly, reconnecting in 100 ms
WARN  ReceiveHelper - Connection closed unexpectedly, reconnecting in 100 ms
WARN  ReceiveHelper - Connection closed unexpectedly, reconnecting in 100 ms
WARN  ReceiveHelper - Connection closed unexpectedly, reconnecting in 100 ms
WARN  ManagerImpl - No profile name set. When sending a message it's recommended to set a profile name with the updateProfile command. This may become mandatory in the future.
WARN  ReceiveHelper - Connection closed unexpectedly, reconnecting in 100 ms
WARN  ReceiveHelper - Connection closed unexpectedly, reconnecting in 100 ms
WARN  ManagerImpl - No profile name set. When sending a message it's recommended to set a profile name with the updateProfile command. This may become mandatory in the future.
WARN  ReceiveHelper - Connection closed unexpectedly, reconnecting in 100 ms
WARN  ReceiveHelper - Connection closed unexpectedly, reconnecting in 100 ms

in /var/log/supervisor/supervisord.log - last entry here was before the issue reappeared

2024-01-25 19:55:13,025 CRIT Supervisor is running as root.  Privileges were not dropped because no user is specified in the config file.  If you intend to run as root, you can set user=root in the config file to avoid this message.
2024-01-25 19:55:13,026 WARN For [program:signal-cli-json-rpc-1], redirect_stderr=true but stderr_logfile has also been set to a filename, the filename has been ignored
2024-01-25 19:55:13,026 INFO Included extra file "/etc/supervisor/conf.d/signal-cli-json-rpc-1.conf" during parsing
2024-01-25 19:55:13,335 INFO RPC interface 'supervisor' initialized
2024-01-25 19:55:13,335 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-01-25 19:55:13,338 INFO daemonizing the supervisord process
2024-01-25 19:55:13,339 INFO supervisord started with pid 32
2024-01-25 19:55:13,682 INFO spawned: 'signal-cli-json-rpc-1' with pid 39
2024-01-25 19:55:14,692 INFO success: signal-cli-json-rpc-1 entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

apart from changing to native mode - anything you want me to try so we can find the cause of the issue?

GatoPharaoh commented 7 months ago

mhhh - just tried sending a message over curl after it got stuck in home assistant - the curl message is sent and received successfully - but the container doesn't return any response - so the terminal gets "stuck" until i kill it with SIGINT. other curl calls don't repsond either, as i just tried to update the profile name via curl.

grafik

after restarting the signal-cli-rest-api container and sending a curl call I get a response with a timestamp. so it does look like the container gets kind of stuck at some point.

response after restarting the container: grafik

bbernhard commented 7 months ago

Really strange. If not already done, please try to set a profile name via the Update Profile PUT request.

I think we need to find out, whether the REST API wrapper or the signal-cli process gets stuck. Could you please try to change the log level to debug with curl -X POST -H "Content-Type: application/json" -d '{"logging": {"level": "debug"}}' 'http://127.0.0.1:8080/v1/configuration' and then post the (obfuscated!) log output from the docker container log here. Hopefully that gives us more information.

Please make sure to not restart the docker container after you've set the log level to debug. as the log level setting isn't persistent. So every restart of the docker container resets the log level again to its default state.

GatoPharaoh commented 7 months ago

please try to set a profile name via the Update Profile PUT request.

done that after i saw it in the logs today - entries stopped appearing since

I think we need to find out, whether the REST API wrapper or the signal-cli process gets stuck. Could you please try to change the log level to debug with curl -X POST -H "Content-Type: application/json" -d '{"logging": {"level": "debug"}}' 'http://127.0.0.1:8080/v1/configuration' and then post the (obfuscated!) log output from the docker container log here. Hopefully that gives us more information.

got it - enabled debugging - will report back

GatoPharaoh commented 7 months ago

5 days and more than a dozen messages later and not a single automation got stuck since. signal-cli-rest-api logs are clean as well. wondering if it really just was the profile name that needed to be set? or just a coincidence?

@bbernhard after setting the log level to debug, there hasn't been a single entry to the logs. is this expected behavior? or should the log be full of entries even when just sending a message successfully?

bbernhard commented 7 months ago

It could be the profile name. I think those warning messages are also added to the JSON payload that is returned from signal-cli. Maybe there is a case where we do not parse those messages properly and we then return a JSON payload that can't be handled properly by the HA integration and then the integration gets stuck and waits for a response forever.

There should be a lot of messages written. However, those log entries aren't written to the log files you've mentioned above. They are written to the the docker-compose log file. Maybe you've disabled logging in tje docker-composr.yml file?

GatoPharaoh commented 7 months ago

my bad - forogt about the docker-compose log file. just looked into the log. only thing that constantly reapperas is the following

{"log":"time=\"2024-01-29T12:16:08Z\" level=debug msg=\"Couldn't send message to golang channel, as there's no receiver\"\n","stream":"stderr","time":"2024-01-29T12:16:08.620433913Z"}

that message appears after every message that is sent. sometimes even four times in a row

{"log":"time=\"2024-01-29T12:16:21Z\" level=debug msg=\"Couldn't send message to golang channel, as there's no receiver\"\n","stream":"stderr","time":"2024-01-29T12:16:21.129796078Z"}
{"log":"time=\"2024-01-29T12:16:21Z\" level=debug msg=\"Couldn't send message to golang channel, as there's no receiver\"\n","stream":"stderr","time":"2024-01-29T12:16:21.677083043Z"}
{"log":"time=\"2024-01-29T12:16:22Z\" level=debug msg=\"Couldn't send message to golang channel, as there's no receiver\"\n","stream":"stderr","time":"2024-01-29T12:16:22.155938329Z"}
{"log":"time=\"2024-01-29T12:16:25Z\" level=debug msg=\"Couldn't send message to golang channel, as there's no receiver\"\n","stream":"stderr","time":"2024-01-29T12:16:25.065024662Z"}

but, as i said, issue hasn't come up since. will continue to monitor.

bbernhard commented 7 months ago

I guees you do not have a websocket connection established to the receive endpoint? In this case it is fine - it is just some verbose debug logging to troubleshoot websocket issues with the receive endpoint.

GatoPharaoh commented 7 months ago

ah okay - i run a small websocket script to hand over any received messages to home assistant. but haven't had any issues with that.

GatoPharaoh commented 7 months ago

after testing for more than 2 weeks, the issue has not come back. only thing I did was setting the profile name.

bbernhard commented 7 months ago

Thanks for the feedback - glad it is working now!

viperk1 commented 7 months ago

Just as an additional data point - I've been following this issue with the same problem and the 'set profile' resolution also worked for me.

issue-triage-workflows[bot] commented 4 months ago

There hasn't been any activity on this issue recently. Due to the high number of incoming GitHub notifications, we have to clean some of the old issues, as many of them have already been resolved with the latest updates. Please make sure to update to the latest Home Assistant version and check if that solves the issue. Let us know if that works for you by adding a comment 👍 This issue has now been marked as stale and will be closed if no further activity occurs. Thank you for your contributions.