kchristensen / udm-le

Let's Encrypt support for Ubiquiti UniFi OS
MIT License
572 stars 79 forks source link

unifi-os doesn't restart #29

Closed klemens-st closed 3 years ago

klemens-st commented 3 years ago

Expected With the on-boot script, udm-le waits 5 minutes then installs certs and restarts unifi-os

Actual The unifi-os restart command that the script runs doesn't work. Unifi-os is stopped but doesn't restart, everything is back to normal only after manually executing unifi-os restart.

UDM-PRO firmware 1.8.6 on-boot-script 1.0.4 udm-le 1.0.7

kchristensen commented 3 years ago

I've seen this once or twice myself but never captured what is causing it to fail. I don't suppose you have any logs or anything that we might be able to use to figure out why it didn't restart?

Without anything to go on the only thing I can think of would be some kind of Until loop (provided this shell supports it) that does a restart until `unifi-os | grep 'is already running'

klemens-st commented 3 years ago

In my case unfortnately it happens every time udm-pro reboots.

I have ssh access to udm-pro, I didn't paste any logs in my original message because I'm not very familiar yet with that system and don't know which log files are relevant to this problem. Is it just /mnt/data/log/messages or some other files too? In /mnt/data/log/messages there's this at the exact moment udm-le script runs: Mar 8 14:50:43 bryza user.notice unifi-os: Stopping unifi-os but then unifi-os never starts.

kchristensen commented 3 years ago

This might do the trick, I'll see about working into the master branch:

# cat foo
until [ "$(podman inspect -f '{{.State.Running}}' "unifi-os" 2>&1)" = "true" ];
do
    unifi-os restart
    sleep 5
done

# unifi-os stop; sh -x foo
unifi-os: Stopping unifi-os
unifi-os: 200dace39f31e83cfbb1ee947d6016c76c588d2864c652f69769f261db4f777e
unifi-os: Stopping unifi-os SSH daemon... OK
+ podman inspect -f '{{.State.Running}}' unifi-os
+ '[' false '=' true ]
+ unifi-os restart
unifi-os: unifi-os is not running
unifi-os: Stopping unifi-os SSH daemon... OK
unifi-os: Starting unifi-os
unifi-os: Stopping unifi-os SSH daemon... OK
unifi-os: Starting unifi-os SSH daemon... OK
unifi-os: unifi-os
+ sleep 5
+ podman inspect -f '{{.State.Running}}' unifi-os
+ '[' true '=' true ]
#
klemens-st commented 3 years ago

Here's /mnt/data/log/messages from the moment udm-pro rebooted to the moment I ran unifi-os restart messages1.log

kchristensen commented 3 years ago

Ok, I pushed up a possible fix. It'll try to restart the container, and if that fails retry 5 times before it bails out. This might catch whatever edge case we're seeing.

https://github.com/kchristensen/udm-le/commit/83d7be173c2fe36b48e7a5bf827efc4535787f70

Try that on for size and if it looks good I'll cut a new release at some point soon.

klemens-st commented 3 years ago

I tried it, updated the file and rebooted udm-pro, no luck. It's exactly the same as before. unifi-os: Stopping unifi-os 5 minutes after reboot and it never starts back again.

klemens-st commented 3 years ago

Maybe I will just add another script to on-boot.d that restarts the container 6 or 7 minutes after reboot. Will give it a try tomorrow.

kchristensen commented 3 years ago

That's really odd, you can try increasing the sleep in restart_unifi_os to be longer than 5 seconds and it'll do what you're after.

kchristensen commented 3 years ago

Another thing to try, edit this line: https://github.com/kchristensen/udm-le/blob/master/on_boot.d/99-udm-le.sh#L10

Change it to something like: sh -x ${UDM_LE_PATH}/udm-le.sh bootrenew > /var/log/udm-le.log. It should dump a verbose run of what it is trying to do to /var/log/udm-le.log which might shed some light on what is going on.

kchristensen commented 3 years ago

Another thing that stands out to me: On reboot, if you don't need to reissue a certificate because it is > 60 days old, udm-le doesn't try to restart the unifi-os container because there's no reason to (unless you're using a captive portal certificate). If you have a valid certificate, and you reboot it effectively does nothing but exit.

So, if you know you have a valid certificate and on a reboot unifi-os isn't starting, I feel like there's something else going on we don't know about.

klemens-st commented 3 years ago

The output of sh -x ${UDM_LE_PATH}/udm-le.sh bootrenew > /var/log/udm-le.log was:

The certificate expires in 89 days, the number of days defined to perform the renewal is 60: no renewal. No new certificate was found, exiting without restart.

unless you're using a captive portal certificate

Yes, I was trying to set that up, will try without the captive portal and see if it works.

kchristensen commented 3 years ago

Ok that's a clue, see if it works w/o the captive portal and we can go from there. I don't personally use it, so there might be some bug there I haven't run into.

klemens-st commented 3 years ago

I tried without the captive portal. Still same problem.

So, if you know you have a valid certificate and on a reboot unifi-os isn't starting, I feel like there's something else going on we don't know about.

It's not that it isn's starting at all, when UDM-PRO reboots unifi-os starts properly but then stops after 5 minutes when udm-le.sh bootrenew is run

On reboot, if you don't need to reissue a certificate because it is > 60 days old, udm-le doesn't try to restart the unifi-os container because there's no reason to (unless you're using a captive portal certificate).

Not really. udm-le.sh renew includes this check: https://github.com/kchristensen/udm-le/blob/a3f2b922e4fa5d7862511e73c74584330dd6873c/udm-le.sh#L104-L106 but bootrenew doesn't, it runs add_captive and restart_unifi_os regardless: https://github.com/kchristensen/udm-le/blob/a3f2b922e4fa5d7862511e73c74584330dd6873c/udm-le.sh#L110 I looked it up and it seems it was discussed and agreed on in #8. So it seems it fails on the unifi-os restart bit.

kchristensen commented 3 years ago

Hm, I guess the route to take to get to the bottom of it is to add some more logging to restart_unifi_os to see if we can't capture more of the output or something. It's weird that you only see the Stopping unifi-os bit and nothing further as that's part of the output of unifi-os restart indicating that it did execute it, but then is failing somewhere in the unifi-os code.

Speedliner commented 3 years ago

Actual The unifi-os restart command that the script runs doesn't work. Unifi-os is stopped but doesn't restart, everything is back to normal only after manually executing unifi-os restart.

I' facing this problem as well. My UDMP also refuses to work after five minutes and has to be revived with a "unifi-os restart".

UDM-PRO firmware 1.9.1 on-boot-script 1.0.4 udm-le 1.0.7

Is there anything i can provide beyond that to help solve this?

klemens-st commented 3 years ago

Hm, I guess the route to take to get to the bottom of it is to add some more logging to restart_unifi_os

Yes, I was going to try but haven't found the time recently. Will report here once I have tested more.

klemens-st commented 3 years ago

I am also going to test a simple on boot script containing unifi-os restart only because my suspicion is this has something to do with how on boot scripts are executed on my udm-pro

kchristensen commented 3 years ago

I'm on 1.9.1 / on-boot-script 1.0.4 and oddly I haven't run into this lately so figuring out how to get more verbose logs here is going to be key. unifi-os is just a shell script, so if you modify the restart_unifi_os command to call sh -x /usr/sbin/unifi-os restart instead of just unifi-os restart and pipe that output somewhere as we discussed in this thread previously we might get something of use out of there.

Speedliner commented 3 years ago

I changed the udm-le.sh script as following:

grafik

After rebooting the UDMP and waiting for 5 minutes, it creates the udm-le.log file under /var/log/ - but it stays empty:

grafik

What did I wrong?

kchristensen commented 3 years ago

Oh, I think unifi-os does something wonky with stdout, try sh -x /usr/sbin/unifi-os restart 2>&1 > /var/log/udm-le.log maybe?

Speedliner commented 3 years ago

Sadly, that didn't help...

grafik

grafik

klemens-st commented 3 years ago

Yeah, that was never going to work.

From /usr/sbin/unifi-os

case "$cmd" in
    stop)
        stop 2>&1 | logger -st unifi-os
        ;;
    start)
        start 2>&1 | logger -st unifi-os
        ;;
    reset)
        reset 2>&1 | logger -st unifi-os
        ;;
    restart)
        restart 2>&1 | logger -st unifi-os
        ;;
    running)
        container_is_running "${CONTAINER_NAME}"
        ;;
    shell)
        podman exec -ti "${CONTAINER_NAME}" bash
        ;;
    update)
        update "${2}" 2>&1 | logger -st unifi-os
        ;;
    *)
        echo "Usage: $0 [stop start restart shell 'update url']"
        ;;
esac
klemens-st commented 3 years ago

Go ahead and modify /usr/sbin/unifi-os directly, just make a solid backup first.

Or you can just have a look at /var/log/messages which is where logger is outputting stuff

Speedliner commented 3 years ago

Okay, after extensive testing, I came to the following conclustion:

Editing unifi-os directly works - as long as you do not reboot. After rebooting, those changes made are reverted and nothing gets logged anymore.

klemens-st commented 3 years ago

Yeah, that was never going to work.

Not really, I was wrong there. The -x switch does work but the output is written to syslog at /var/log/messages and not to the file we specified, the file remains empty.

I did some further testing. I created an on boot script that contained only

wait 600
unifi-os restart

It failed the same way udm-le fails. So it's clearly something with on boot scripts execution on my udm-pro.

I narrowed down the problem further by modifying my script to sh -x /usr/sbin/unifi-os restart and inspecting /var/log/messages.

unifi-os restart calls the stop function first which is as follows:

stop() {
    if ! container_is_running "${CONTAINER_NAME}"; then
        echo "${CONTAINER_NAME} is not running"
    else
        echo "Stopping ${CONTAINER_NAME}"
        podman stop "${CONTAINER_NAME}"
    fi
    stop_dropbear_daemon
}

According to the logs I saw, podman stop is the last command that executes, stop_dropbear_deamon or the restart function that should be called later are never executed. This is because the script receives a SIGTERM.

Mar 15 21:52:09 unifi user.notice unifi-os: + stop
Mar 15 21:52:09 unifi user.notice unifi-os: + container_is_running unifi-os
Mar 15 21:52:09 unifi user.notice unifi-os: + podman inspect -f '{{.State.Running}}' unifi-os
Mar 15 21:52:09 unifi user.notice unifi-os: + '[' true '=' true ]
Mar 15 21:52:09 unifi user.notice unifi-os: + echo 'Stopping unifi-os'
Mar 15 21:52:09 unifi user.notice unifi-os: Stopping unifi-os
Mar 15 21:52:09 unifi user.notice unifi-os: + podman stop unifi-os
Mar 15 21:52:15 unifi syslog.info kernel: [  333.980562] systemd-journald[26]: Received SIGTERM from PID 1 (system
d-shutdow).
Mar 15 21:52:15 unifi user.emerg kernel: [  333.982681] reboot: cmd=cdef0123
Mar 15 21:52:15 unifi user.emerg kernel: [  333.986192] reboot:   called by: [ 2122]/lib/systemd/systemd-shutdown
halt --timeout 90000000us --log-level 6 --log-target console --log-color
Mar 15 21:52:15 unifi user.emerg kernel: [  333.999053] reboot:   called by: [ 2108]/usr/libexec/podman/conmon --a
pi-version 1 -c 4c2e68e3a60876d796608aa35e9c4e9e2ccf9417127022a8c4ca0b96e5512619 -u 4c2e68e3a6087
Mar 15 21:52:15 unifi user.emerg kernel: [  334.014347] reboot:   called by: [    1]init
Mar 15 21:52:15 unifi user.emerg kernel: [  334.018642] reboot:   called by: [    0]

You can see that just after podman stop unifi-os the script receives a SIGTERM, it then finishes shutting down unifi-os and exits. This never happens when unifi-os restart is run from command line. I have absolutely no idea where this is comming from, I think I'm going to ask on udm-utilities issues.

emlynmac commented 3 years ago

I have been experiencing unifi-os dying since I installed and thought it was something flaky with the utils / OS. Now, I understand that it was this little restart fail causing the issue.

My understanding is that the udm-le script will be called from the context of the container, so when you call restart from inside, then it shuts down and gets killed and doesn't restart, as shown in the logs above.

I think there are some ways to solve this:

  1. Somehow figure out how to spawn off the unifi-os start command into something that out-lives the container and doesn't get killed
  2. Find an alternative way to get the OS to pick up the new certificates. Is there not a way that we can get the web services to restart without pulling down the whole container?
  3. Schedule a restart using the unifi-os to happen at least every 70 days, which should then pick up the new certs.

For now, I have disabled the restart in my scripts and will restart it when I need to manually - I'd rather have the certs fail than the whole unifi-os not working after a power outage.

kchristensen commented 3 years ago

The on-boot-script actually executes the init script on boot on the host, not in the container. The unifi-os script is basically a wrapper around podman commands that starts up the container that runs the controller, etc.

The issue is that when whatever causes the restart on boot to fail, the controller and captive portal and what have you won't be running. The actual routing stuff happens on the host and should be unaffected by the failure to restart.

As for #3, that happens already, provided udm-le.sh gets run every boot it drops a file in the cron.d directory that will handle the nightly check and subsequent re-issue/restart if necessary.

klemens-st commented 3 years ago

My understanding is that the udm-le script will be called from the context of the container, so when you call restart from inside, then it shuts down and gets killed and doesn't restart, as shown in the logs above.

I thought the same but then realised not everyone experiences this issue, which is strange.

kchristensen commented 3 years ago

My understanding is that the udm-le script will be called from the context of the container, so when you call restart from inside, then it shuts down and gets killed and doesn't restart, as shown in the logs above.

I thought the same but then realised not everyone experiences this issue, which is strange.

Indeed. I have no issues with the way this is setup and I've had this running on my UDMP since more or less the Early Access days of the udmp. That said I only use Cloudflare for DNS and none of the captive portal things that people have added over time, so it is hard to say if that is possibly causing issues.

emlynmac commented 3 years ago

The issue is that when whatever causes the restart on boot to fail, the controller and captive portal and what have you won't be running. The actual routing stuff happens on the host and should be unaffected by the failure to restart.

Routing won't be affected, but protect will not be up - which is a problem.

My setup has udm-le, cloudflare-ddns, pihole, cloudflared and homebridge containers. I don't have captive portal set up. Running unifi-os restart from the SSH console works fine.

klemens-st commented 3 years ago

I have disabled the captive portal and have nothing else installed besides on-boot and udm-le. And as I have recently tested it's just the restart command that fails, it has nothing to do with udm-le itself.

emlynmac commented 3 years ago

Does restart work from the shell?

klemens-st commented 3 years ago

Yes, always.

klemens-st commented 3 years ago

But I think you are on the right lines that when on-boot script runs unifi-os restart its executed in a different context. The kernel part of the logs when it's run from shell is different.

emlynmac commented 3 years ago

So peacey over at the udm-utilities has a fix for this issue.

I took his second suggestion and replaced unifi-os restart with pod man restart unifi-os &

The last case is now:

 echo 'Attempting certificate renewal'                                                                
        ${PODMAN_CMD} ${LEGO_ARGS} renew --days 60 && deploy_cert && add_captive && podman restart unifi-os &
        ;;

Seems to have resolved the issue

klemens-st commented 3 years ago

Finally found the time to test 1.0.9. It fixed the issue, thanks!