Closed klemens-st closed 3 years ago
I've seen this once or twice myself but never captured what is causing it to fail. I don't suppose you have any logs or anything that we might be able to use to figure out why it didn't restart?
Without anything to go on the only thing I can think of would be some kind of Until loop (provided this shell supports it) that does a restart until `unifi-os | grep 'is already running'
In my case unfortnately it happens every time udm-pro reboots.
I have ssh access to udm-pro, I didn't paste any logs in my original message because I'm not very familiar yet with that system and don't know which log files are relevant to this problem. Is it just /mnt/data/log/messages or some other files too? In /mnt/data/log/messages there's this at the exact moment udm-le script runs:
Mar 8 14:50:43 bryza user.notice unifi-os: Stopping unifi-os
but then unifi-os never starts.
This might do the trick, I'll see about working into the master branch:
# cat foo
until [ "$(podman inspect -f '{{.State.Running}}' "unifi-os" 2>&1)" = "true" ];
do
unifi-os restart
sleep 5
done
# unifi-os stop; sh -x foo
unifi-os: Stopping unifi-os
unifi-os: 200dace39f31e83cfbb1ee947d6016c76c588d2864c652f69769f261db4f777e
unifi-os: Stopping unifi-os SSH daemon... OK
+ podman inspect -f '{{.State.Running}}' unifi-os
+ '[' false '=' true ]
+ unifi-os restart
unifi-os: unifi-os is not running
unifi-os: Stopping unifi-os SSH daemon... OK
unifi-os: Starting unifi-os
unifi-os: Stopping unifi-os SSH daemon... OK
unifi-os: Starting unifi-os SSH daemon... OK
unifi-os: unifi-os
+ sleep 5
+ podman inspect -f '{{.State.Running}}' unifi-os
+ '[' true '=' true ]
#
Here's /mnt/data/log/messages from the moment udm-pro rebooted to the moment I ran unifi-os restart
messages1.log
Ok, I pushed up a possible fix. It'll try to restart the container, and if that fails retry 5 times before it bails out. This might catch whatever edge case we're seeing.
https://github.com/kchristensen/udm-le/commit/83d7be173c2fe36b48e7a5bf827efc4535787f70
Try that on for size and if it looks good I'll cut a new release at some point soon.
I tried it, updated the file and rebooted udm-pro, no luck. It's exactly the same as before. unifi-os: Stopping unifi-os
5 minutes after reboot and it never starts back again.
Maybe I will just add another script to on-boot.d that restarts the container 6 or 7 minutes after reboot. Will give it a try tomorrow.
That's really odd, you can try increasing the sleep in restart_unifi_os
to be longer than 5 seconds and it'll do what you're after.
Another thing to try, edit this line: https://github.com/kchristensen/udm-le/blob/master/on_boot.d/99-udm-le.sh#L10
Change it to something like: sh -x ${UDM_LE_PATH}/udm-le.sh bootrenew > /var/log/udm-le.log
. It should dump a verbose run of what it is trying to do to /var/log/udm-le.log
which might shed some light on what is going on.
Another thing that stands out to me: On reboot, if you don't need to reissue a certificate because it is > 60 days old, udm-le doesn't try to restart the unifi-os container because there's no reason to (unless you're using a captive portal certificate). If you have a valid certificate, and you reboot it effectively does nothing but exit.
So, if you know you have a valid certificate and on a reboot unifi-os isn't starting, I feel like there's something else going on we don't know about.
The output of sh -x ${UDM_LE_PATH}/udm-le.sh bootrenew > /var/log/udm-le.log
was:
The certificate expires in 89 days, the number of days defined to perform the renewal is 60: no renewal. No new certificate was found, exiting without restart.
unless you're using a captive portal certificate
Yes, I was trying to set that up, will try without the captive portal and see if it works.
Ok that's a clue, see if it works w/o the captive portal and we can go from there. I don't personally use it, so there might be some bug there I haven't run into.
I tried without the captive portal. Still same problem.
So, if you know you have a valid certificate and on a reboot unifi-os isn't starting, I feel like there's something else going on we don't know about.
It's not that it isn's starting at all, when UDM-PRO reboots unifi-os starts properly but then stops after 5 minutes when udm-le.sh bootrenew
is run
On reboot, if you don't need to reissue a certificate because it is > 60 days old, udm-le doesn't try to restart the unifi-os container because there's no reason to (unless you're using a captive portal certificate).
Not really. udm-le.sh renew
includes this check:
https://github.com/kchristensen/udm-le/blob/a3f2b922e4fa5d7862511e73c74584330dd6873c/udm-le.sh#L104-L106
but bootrenew
doesn't, it runs add_captive
and restart_unifi_os
regardless:
https://github.com/kchristensen/udm-le/blob/a3f2b922e4fa5d7862511e73c74584330dd6873c/udm-le.sh#L110
I looked it up and it seems it was discussed and agreed on in #8. So it seems it fails on the unifi-os restart
bit.
Hm, I guess the route to take to get to the bottom of it is to add some more logging to restart_unifi_os
to see if we can't capture more of the output or something. It's weird that you only see the Stopping unifi-os
bit and nothing further as that's part of the output of unifi-os restart
indicating that it did execute it, but then is failing somewhere in the unifi-os
code.
Actual The
unifi-os restart
command that the script runs doesn't work. Unifi-os is stopped but doesn't restart, everything is back to normal only after manually executingunifi-os restart
.
I' facing this problem as well. My UDMP also refuses to work after five minutes and has to be revived with a "unifi-os restart".
UDM-PRO firmware 1.9.1 on-boot-script 1.0.4 udm-le 1.0.7
Is there anything i can provide beyond that to help solve this?
Hm, I guess the route to take to get to the bottom of it is to add some more logging to restart_unifi_os
Yes, I was going to try but haven't found the time recently. Will report here once I have tested more.
I am also going to test a simple on boot script containing unifi-os restart
only because my suspicion is this has something to do with how on boot scripts are executed on my udm-pro
I'm on 1.9.1 / on-boot-script 1.0.4 and oddly I haven't run into this lately so figuring out how to get more verbose logs here is going to be key. unifi-os
is just a shell script, so if you modify the restart_unifi_os
command to call sh -x /usr/sbin/unifi-os restart
instead of just unifi-os restart
and pipe that output somewhere as we discussed in this thread previously we might get something of use out of there.
I changed the udm-le.sh script as following:
After rebooting the UDMP and waiting for 5 minutes, it creates the udm-le.log file under /var/log/ - but it stays empty:
What did I wrong?
Oh, I think unifi-os does something wonky with stdout, try sh -x /usr/sbin/unifi-os restart 2>&1 > /var/log/udm-le.log
maybe?
Sadly, that didn't help...
Yeah, that was never going to work.
From /usr/sbin/unifi-os
case "$cmd" in
stop)
stop 2>&1 | logger -st unifi-os
;;
start)
start 2>&1 | logger -st unifi-os
;;
reset)
reset 2>&1 | logger -st unifi-os
;;
restart)
restart 2>&1 | logger -st unifi-os
;;
running)
container_is_running "${CONTAINER_NAME}"
;;
shell)
podman exec -ti "${CONTAINER_NAME}" bash
;;
update)
update "${2}" 2>&1 | logger -st unifi-os
;;
*)
echo "Usage: $0 [stop start restart shell 'update url']"
;;
esac
Go ahead and modify /usr/sbin/unifi-os directly, just make a solid backup first.
Or you can just have a look at /var/log/messages
which is where logger
is outputting stuff
Okay, after extensive testing, I came to the following conclustion:
Editing unifi-os directly works - as long as you do not reboot. After rebooting, those changes made are reverted and nothing gets logged anymore.
Yeah, that was never going to work.
Not really, I was wrong there. The -x
switch does work but the output is written to syslog at /var/log/messages and not to the file we specified, the file remains empty.
I did some further testing. I created an on boot script that contained only
wait 600
unifi-os restart
It failed the same way udm-le fails. So it's clearly something with on boot scripts execution on my udm-pro.
I narrowed down the problem further by modifying my script to sh -x /usr/sbin/unifi-os restart
and inspecting /var/log/messages.
unifi-os restart
calls the stop
function first which is as follows:
stop() {
if ! container_is_running "${CONTAINER_NAME}"; then
echo "${CONTAINER_NAME} is not running"
else
echo "Stopping ${CONTAINER_NAME}"
podman stop "${CONTAINER_NAME}"
fi
stop_dropbear_daemon
}
According to the logs I saw, podman stop
is the last command that executes, stop_dropbear_deamon
or the restart
function that should be called later are never executed. This is because the script receives a SIGTERM.
Mar 15 21:52:09 unifi user.notice unifi-os: + stop
Mar 15 21:52:09 unifi user.notice unifi-os: + container_is_running unifi-os
Mar 15 21:52:09 unifi user.notice unifi-os: + podman inspect -f '{{.State.Running}}' unifi-os
Mar 15 21:52:09 unifi user.notice unifi-os: + '[' true '=' true ]
Mar 15 21:52:09 unifi user.notice unifi-os: + echo 'Stopping unifi-os'
Mar 15 21:52:09 unifi user.notice unifi-os: Stopping unifi-os
Mar 15 21:52:09 unifi user.notice unifi-os: + podman stop unifi-os
Mar 15 21:52:15 unifi syslog.info kernel: [ 333.980562] systemd-journald[26]: Received SIGTERM from PID 1 (system
d-shutdow).
Mar 15 21:52:15 unifi user.emerg kernel: [ 333.982681] reboot: cmd=cdef0123
Mar 15 21:52:15 unifi user.emerg kernel: [ 333.986192] reboot: called by: [ 2122]/lib/systemd/systemd-shutdown
halt --timeout 90000000us --log-level 6 --log-target console --log-color
Mar 15 21:52:15 unifi user.emerg kernel: [ 333.999053] reboot: called by: [ 2108]/usr/libexec/podman/conmon --a
pi-version 1 -c 4c2e68e3a60876d796608aa35e9c4e9e2ccf9417127022a8c4ca0b96e5512619 -u 4c2e68e3a6087
Mar 15 21:52:15 unifi user.emerg kernel: [ 334.014347] reboot: called by: [ 1]init
Mar 15 21:52:15 unifi user.emerg kernel: [ 334.018642] reboot: called by: [ 0]
You can see that just after podman stop unifi-os
the script receives a SIGTERM, it then finishes shutting down unifi-os and exits. This never happens when unifi-os restart
is run from command line. I have absolutely no idea where this is comming from, I think I'm going to ask on udm-utilities issues.
I have been experiencing unifi-os dying since I installed and thought it was something flaky with the utils / OS. Now, I understand that it was this little restart fail causing the issue.
My understanding is that the udm-le script will be called from the context of the container, so when you call restart from inside, then it shuts down and gets killed and doesn't restart, as shown in the logs above.
I think there are some ways to solve this:
For now, I have disabled the restart in my scripts and will restart it when I need to manually - I'd rather have the certs fail than the whole unifi-os not working after a power outage.
The on-boot-script actually executes the init script on boot on the host, not in the container. The unifi-os
script is basically a wrapper around podman commands that starts up the container that runs the controller, etc.
The issue is that when whatever causes the restart on boot to fail, the controller and captive portal and what have you won't be running. The actual routing stuff happens on the host and should be unaffected by the failure to restart.
As for #3, that happens already, provided udm-le.sh
gets run every boot it drops a file in the cron.d directory that will handle the nightly check and subsequent re-issue/restart if necessary.
My understanding is that the udm-le script will be called from the context of the container, so when you call restart from inside, then it shuts down and gets killed and doesn't restart, as shown in the logs above.
I thought the same but then realised not everyone experiences this issue, which is strange.
My understanding is that the udm-le script will be called from the context of the container, so when you call restart from inside, then it shuts down and gets killed and doesn't restart, as shown in the logs above.
I thought the same but then realised not everyone experiences this issue, which is strange.
Indeed. I have no issues with the way this is setup and I've had this running on my UDMP since more or less the Early Access days of the udmp. That said I only use Cloudflare for DNS and none of the captive portal things that people have added over time, so it is hard to say if that is possibly causing issues.
The issue is that when whatever causes the restart on boot to fail, the controller and captive portal and what have you won't be running. The actual routing stuff happens on the host and should be unaffected by the failure to restart.
Routing won't be affected, but protect will not be up - which is a problem.
My setup has udm-le, cloudflare-ddns, pihole, cloudflared and homebridge containers. I don't have captive portal set up.
Running unifi-os restart
from the SSH console works fine.
I have disabled the captive portal and have nothing else installed besides on-boot and udm-le. And as I have recently tested it's just the restart command that fails, it has nothing to do with udm-le itself.
Does restart work from the shell?
Yes, always.
But I think you are on the right lines that when on-boot script runs unifi-os restart
its executed in a different context. The kernel part of the logs when it's run from shell is different.
So peacey over at the udm-utilities has a fix for this issue.
I took his second suggestion and replaced unifi-os restart
with pod man restart unifi-os &
The last case is now:
echo 'Attempting certificate renewal'
${PODMAN_CMD} ${LEGO_ARGS} renew --days 60 && deploy_cert && add_captive && podman restart unifi-os &
;;
Seems to have resolved the issue
Finally found the time to test 1.0.9. It fixed the issue, thanks!
Expected With the on-boot script, udm-le waits 5 minutes then installs certs and restarts unifi-os
Actual The
unifi-os restart
command that the script runs doesn't work. Unifi-os is stopped but doesn't restart, everything is back to normal only after manually executingunifi-os restart
.UDM-PRO firmware 1.8.6 on-boot-script 1.0.4 udm-le 1.0.7