Hydrus doesn't exit gracefully when stopping the docker container

TempestoGuy commented 1 month ago

Hydrus version

v591

Qt major version

Qt 6

Operating system

Linux (specify distro and version in comments)

Install method

Third party (AUR, Docker, Chocolatey, etc. Specify in comments)

Install and OS comments

Host is a Debian 12 headless server Using the docker image of hydrusclient

Bug description and reproduction

When stopping Hydrus from docker it always exits incorrectly, and on the next Hydrus boot it says Found and deleted the durable temporary database on boot. The last exit was probably not clean. in the logs.

And the client does that popup that says do you want to restore last session

So to exit correctly i have to exit the client from the GUI and then stop it from docker quickly before the supervisor restarts Hydrus again.

I also tried increasing the grace time in docker compose stop_grace_period: 120s, but that didn't work.

My server is pretty weak so at first i thought it just didn't save the db and close the client fast enough, but I don't think the client is even getting a signal to start turning off.

Also thank you for your amazing work, Hydrus is amazing

Log output

logs when shutting down without exiting the client from the gui first:

hydrusclient  | 2024-09-26 17:41:54,378 WARN received SIGTERM indicating exit request
hydrusclient  | 2024-09-26 17:41:54,379 INFO waiting for fvwm, hydrus, novnc, vnc, xvfb to die
hydrusclient  | 2024-09-26 17:41:54,408 INFO stopped: xvfb (exit status 0)
hydrusclient  | caught signal: 15
hydrusclient  | 26/09/2024 17:41:54 deleted 43 tile_row polling images.
hydrusclient  | 2024-09-26 17:41:54,443 WARN exited: fvwm (exit status 1; not expected)
hydrusclient  | 2024-09-26 17:41:54,443 INFO reaped unknown pid 44 (exit status 1)
hydrusclient  | 2024-09-26 17:41:54,443 INFO reaped unknown pid 46 (exit status 1)
hydrusclient  | 2024-09-26 17:41:54,452 WARN stopped: vnc (exit status 2)
hydrusclient  |
hydrusclient  | Terminating WebSockets proxy (30)
hydrusclient  | 2024-09-26 17:41:54,453 WARN stopped: novnc (exit status 143)
hydrusclient  | 2024-09-26 17:41:55,456 INFO reaped unknown pid 30 (exit status 120)
hydrusclient  | 2024-09-26 17:41:56,512 WARN stopped: hydrus (exit status 1)
hydrusclient  | 2024-09-26 17:41:56,512 INFO reaped unknown pid 48 (exit status 1)
hydrusclient exited with code 0

Relevant logs when client booted after stopping with docker:

hydrusclient  | 2024-09-26 17:44:22,132 INFO success: fvwm entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
hydrusclient  | 2024-09-26 17:44:22,133 INFO success: hydrus entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
hydrusclient  | 2024-09-26 17:44:22,134 INFO success: novnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
hydrusclient  | 2024-09-26 17:44:22,134 INFO success: vnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
hydrusclient  | 2024-09-26 17:44:22,134 INFO success: xvfb entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
hydrusclient  | v591, 2024-09-26 17:44:22: hydrus client started
hydrusclient  | v591, 2024-09-26 17:44:24: booting controller…
hydrusclient  | v591, 2024-09-26 17:44:24: booting db…
hydrusclient  | v591, 2024-09-26 17:44:24: Found and deleted the durable temporary database on boot. The last exit was probably not clean.
hydrusclient  | v591, 2024-09-26 17:44:24: checking database
hydrusclient  | v591, 2024-09-26 17:44:25: initialising managers
hydrusclient  | v591, 2024-09-26 17:44:25: booting gui…
hydrusclient  | v591, 2024-09-26 17:44:26: starting services…
hydrusclient  | v591, 2024-09-26 17:44:26: Running "client api" on port 45869.
hydrusclient  | v591, 2024-09-26 17:44:26: services started

hydrusnetwork commented 1 month ago

Thank you for this report. I am sorry I am late getting to it.

I have had a good look at my shutdown code, and I might have fixed this for v594. I am very ignorant about how Docker works, and how the hydrus Docker package works, but it looks like Docker tries to shut the client process down with SIGTERM, and that is coming back two seconds later with exit status 1 (i.e. something went wrong during the shutdown). I suspect my SIGTERM exit was touching something from the wrong thread. I don't have a Docker instance (or expertise) to test this properly, and my dev machine is Windows, where SIGTERM is a whole different kettle of fish, so we may need to do a couple of rounds of back-and-forth to fully nail this down.

In any case, I have cleaned up how I handle a SIGTERM to ensure it tries to shut things down as quickly as possible while still saving and cleaning everything it can, and handling the different thread interactions properly, and I cleaned some of my shutdown code generally. Please let me know how v594 goes for you, and if it does not, please check out the client log in install_dir?/db_dir/client - date.log and see if there is a nice healthy traceback error in the shutdown section. A nice clean SIGTERM shutdown should just be three lines:

v593, 2024-10-14 15:59:16: doing fast shutdown…
v593, 2024-10-14 15:59:16: shutting down controller…
v593, 2024-10-14 15:59:16: hydrus client shut down

There are also some new 'self-sigterm' calls under help->debug->tests that you might also like to play with, although I guess if you just sigterm the hydrus process, the Docker 'hydrusclient'(?) guy, who I assume is PID 1, will restart it.

I didn't know you couldn't nicely just go file->exit on the program in the Docker package. I guess it the whole question of running hydrus in Docker is so hacky that it isn't a huge deal, but do you have any opinion on how better this could work? Or how it works in other Docker programs? Should I be trying to kill PID 1 or whatever? Should I remove the 'exit' menu entirely if we are in the Docker environment?

TempestoGuy commented 3 weeks ago

Thanks for getting back to me. I am sorry for my late response as well.

I tried v594 and v596 and it does shutdown correctly when I SIGTERM from the container's shell with kill -15 {hydrusclient's pid} or with the new debug tests. it also doesn't restart hydrusclient anymore.

v596, 2024-10-31 23:51:51: doing fast shutdown…
v596, 2024-10-31 23:51:51: shutting down controller…
v596, 2024-10-31 23:51:51: hydrus client shut down

and it doesn't restart hydrusclient when I use file->exit anymore

hydrusclient  | v596, 2024-11-01 02:37:24: saving and hiding gui…
hydrusclient  | v596, 2024-11-01 02:37:25: shutting down gui…
hydrusclient  | v596, 2024-11-01 02:37:25: waiting for managers to exit
hydrusclient  | v596, 2024-11-01 02:37:26: waiting for workers to exit
hydrusclient  | v596, 2024-11-01 02:37:26: waiting for idle shutdown work
hydrusclient  | v596, 2024-11-01 02:37:26: waiting for services to exit
hydrusclient  | v596, 2024-11-01 02:37:26: stopping services…
hydrusclient  | v596, 2024-11-01 02:37:26: shutting down db…
hydrusclient  | v596, 2024-11-01 02:37:26: saving and exiting objects
hydrusclient  | v596, 2024-11-01 02:37:26: cleaning up…
hydrusclient  | v596, 2024-11-01 02:37:26: shutting down controller…
hydrusclient  | v596, 2024-11-01 02:37:26: hydrus client shut down
hydrusclient  | 2024-11-01 02:37:32,571 INFO exited: hydrus (exit status 0; expected)

but it doesn't work when I shutdown from docker docker stop hydrusclient although is says doing fast shutdown… Did it just need more time? maybe adding stopwaitsecs=30 to supervisord.conf could fix it.

hydrusclient  | 2024-11-01 00:55:19,405 WARN received SIGTERM indicating exit request
hydrusclient  | 2024-11-01 00:55:19,406 INFO waiting for fvwm, hydrus, novnc, vnc, xvfb to die
hydrusclient  | caught XIO error:
hydrusclient  | 01/11/2024 00:55:19 deleted 43 tile_row polling images.
hydrusclient  | 2024-11-01 00:55:19,545 INFO stopped: xvfb (exit status 0)
hydrusclient  | extra[1] signal: 15
hydrusclient  | 2024-11-01 00:55:19,641 WARN exited: fvwm (exit status 1; not expected)
hydrusclient  | 2024-11-01 00:55:19,641 INFO reaped unknown pid 39 (exit status 1)
hydrusclient  | 2024-11-01 00:55:19,641 INFO reaped unknown pid 41 (exit status 0)
hydrusclient  | 2024-11-01 00:55:19,641 INFO reaped unknown pid 43 (exit status 1)
hydrusclient  | 2024-11-01 00:55:19,817 WARN stopped: vnc (exit status 3)
hydrusclient  |
hydrusclient  | Terminating WebSockets proxy (30)
hydrusclient  | In exit
hydrusclient  | 2024-11-01 00:55:19,819 WARN stopped: novnc (exit status 143)
hydrusclient  | v596, 2024-11-01 00:55:20: doing fast shutdown…
hydrusclient  | 2024-11-01 00:55:20,314 INFO reaped unknown pid 30 (exit status 0)
hydrusclient  | 2024-11-01 00:55:22,879 WARN stopped: hydrus (exit status 1)
hydrusclient exited with code 0

docker docs say The main process inside the container will receive SIGTERM, and after a grace period, SIGKILL and the main process PID 1 seems to be supervisord checked with top. and supervisord doesn't pass the signal unless specified in the supervisord.conf with stopsignal=Term although it said here that the default signal is TERM I don't think it passes it. Or maybe supervisord somehow ignores the docker SIGTERM or maybe it isn't always PID 1.

I don't know if switching to another minimal init system will fix it, but some of them have signal forwarding to their children processes. here is an article I found with some pros and cons for different mini-init systems for docker containers.

the Docker 'hydrusclient'(?) guy

hydrusclient | in my logs is just the container name. its just how docker compose displays logs since it could be multiple containers per docker-compose.yml

do you have any opinion on how better this could work? Or how it works in other Docker programs? Should I be trying to kill PID 1 or whatever?

I think this is special case since most containers I checked only had a single process that could accept the SIGTERM directly from docker since it was PID1 (although some processes may ignore signals if they are PID 1, I think its because linux treats PID 1 differently, but I am not sure what it does different),

Or they used tini init, from the quick reading I did it seems it can only control one process directly (this is probably wrong), but I think it could be configured to run a process group, and pass down the signal to the process group as a whole, but I haven't looked into it much.

I think using tini as PID 1 and supervisord as the child process of tini (since supervisord doesn't behave well when its PID 1 also not sure about this but I read it somewhere), this will also be the easiest to implement since it doesn't need changing much, but I don't know if it would work.

Its also implemented in docker so you can use it with just docker run --init or docker compose init: true and I tried it but it didn't work as expected and I don't know what is wrong maybe its because supervisord wasn't directly under it as PID 2 or something , or maybe it has to be added to the dockerfile and entrypoint.sh, but I don't know.

hydrusnetwork / hydrus