jeflem / ananke

Jupyter distribution with LMS integration via LTI
GNU Affero General Public License v3.0
5 stars 1 forks source link

Container start sometimes fails on host reboot #28

Open jeflem opened 3 months ago

jeflem commented 3 months ago

If systemd wants to stop an Ananke container default timeout is 10 seconds, which often too short for shutting down all JLab sessions and JHub gracefully resulting not automatically restarting containers after system reboot. Adding --stop-timeout=30 to the podman generate systemd line in run.sh should solve this problem (not tested).

jeflem commented 2 months ago

Seems that stop timeout isn't the problem (Podman sets it to 70 seconds), but the start timeout, which is not set by Podman. Could be set to infinity, see https://www.freedesktop.org/software/systemd/man/latest/systemd.service.html#TimeoutStartSec=

jeflem commented 2 months ago

It's not an issue of start or stop timeouts. Both values are set to 60 seconds on dev/Ananke 0.5. The core issue seems to be nvidia-persistenced.service coming up too slowly. The Ananke container's systemd unit in principle could wait for nvidia-persistenced.service (via --after and --requires arguments to podman generate systemd). But the nvidia service runs as root and Ananke runs as user. Seems that user services are not allowed to depend on root services (see discussion in systemd issue 3312).