rockstor / rockstor-core

Linux/BTRFS based Network Attached Storage(NAS)
http://rockstor.com/docs/contribute_section.html
GNU General Public License v3.0
553 stars 137 forks source link

Pi4 fails on reboot #2843

Closed phillxnet closed 2 months ago

phillxnet commented 4 months ago

We have, post initially successful installer boot, and initial setup - a rockstor-pre.service fail on subsequent reboots.

Observed on 15.5, 15.6, and Tumbleweed based Pi4 installers.

A subsequent start of our top service will then succeed in starting all dependencies:

Pi4:~ # systemctl status rockstor-pre.service 
× rockstor-pre.service - Tasks required prior to starting Rockstor
     Loaded: loaded (/usr/lib/systemd/system/rockstor-pre.service; disabled; preset: disabled)
     Active: failed (Result: exit-code) since Mon 2024-04-22 13:00:14 WEST; 1 month 1 day ago
    Process: 859 ExecStartPre=/usr/bin/gpg --quick-generate-key --batch --passphrase  rockstor@localhost (code=exited, status=2)
    Process: 860 ExecStartPre=/usr/bin/pass init rockstor@localhost (code=exited, status=0/SUCCESS)
    Process: 922 ExecStartPre=/usr/bin/pass rename --force python-keyring/rockstor/SECRET_KEY python-keyring/rockstor/SECRET_KEY_FALLBACK (code=exited, status=0/SUCCESS)
    Process: 963 ExecStartPre=/usr/bin/pass generate --no-symbols --force python-keyring/rockstor/SECRET_KEY 100 (code=exited, status=1/FAILURE)
        CPU: 827ms

Apr 22 13:00:13 Pi4 systemd[1]: Starting Tasks required prior to starting Rockstor...
Apr 22 13:00:14 Pi4 systemd[1]: rockstor-pre.service: Control process exited, code=exited, status=1/FAILURE
Apr 22 13:00:14 Pi4 systemd[1]: rockstor-pre.service: Failed with result 'exit-code'.
Apr 22 13:00:14 Pi4 systemd[1]: Failed to start Tasks required prior to starting Rockstor.
Pi4:~ # /usr/bin/pass generate --no-symbols --force python-keyring/rockstor/SECRET_KEY 100
The generated password for python-keyring/rockstor/SECRET_KEY is:
7P0PHXzbt7fc3CE7WWj95N4VO9yEx1HA8cfsvipZpvpcBqpt6V94YxyZAh08de4SfUMg4izrmhcgzpJlZMNPIhJvm6iN2gvzFYRe
Pi4:~ # pass
Password Store
└── python-keyring
    └── rockstor
        ├── CLIENT_SECRET
        ├── SECRET_KEY_FALLBACK
        └── SECRET_KEY
Pi4:~ # systemctl start rockstor-bootstrap.service 
Pi4:~ #

N.B. the above was observed after already removing rockstor-build.service entries associated with NetworkManager-wait-online; as this was though to be related previously.


Testing/debugging assistance is required for this issue, which pertains to current DIY installer builds, and our pending current 5.0.9-0 (RC4) downloadable installers.

Real-hardware testing is currently compromised for myself.

phillxnet commented 4 months ago

Linking to our website repo PR, now merged re pending updates to our downloadable installers:

phillxnet commented 2 months ago

To examine the idea that we have a time-out error here re password generation we have the following on real Pi4 hardware:

Pi4:~ # time /usr/bin/pass generate --no-symbols --force python-keyring/rockstor/SECRET_KEY 100
The generated password for python-keyring/rockstor/SECRET_KEY is:
XQV4EBg8TlyrRxgagVfScKGqKIqVqcq04i5GsAcr71xGkWjtgopy6pnd1BYqsc2SbXw95ufaERqjw48cKvCvYPhWxFBgOGpBw3IM

real    0m0.139s
user    0m0.072s
sys     0m0.074s

Pi4:~ # time /usr/bin/pass generate --no-symbols --force python-keyring/rockstor/SECRET_KEY 1000
The generated password for python-keyring/rockstor/SECRET_KEY is:
Ks5Gh9YOdnt3LEzqJVb5ieGkrz2GdJTtMz1oKbx2a7v0xDhqsvRfDApyhermkZ5z3n0KlKvoER7ErTiWwdC5SkFTMgF80R1os5vO6rdordIbF3XXiQtYrVwUogCIwGYQyiggFCOq9nUML3l269ad467N0zNbrqAJFmIm3aLf9xaztqNOSzSszQLu1ov7JRhxwINCdWPQ69mlqcnk1JUdNfi3MbpvZjVjMJxU668xtePBPJdMdkBw0B6k58GY69cW1hkhn0kXrdkrTOKVuPH2GahwEDJbqS8oEwLhDJjmrRrngJMQtIeH2PU11CaCjs1zsDYq9HGb1tSy8yBlDyY7uflom328jWUfooQ74hSTtJgesp5MD6z7zCHkNqWr6HkJYDi7VSGGe7FgWb4Gkj7MtKfbCBTDl36cPLK8Mf9J9MrB6vAGJjeiGMuNVa26gVQWicPOa2zTr7KcpgpNB5DMTE1HEyNTLOXFQL6onF0ivIfM4QanMZ6TF7uB244HzQ98P2Xfb4lGQULEp5OxSRffAFKdHl3gN0Gdq3xVVbBHGPUnngYpbV70JFay2DsFYY12QoBoxN00HuVyFDxLEONZc5UAja4JmKfCqgVgNZMNzklK2UpCoKbfACfHpYM0JWhoUWNNn0yNPoHWd1c3jhP7iS94cu1yxApZeqsJclTGgVOVLUa840NswYkC7LsNcVu4MuxkHUtHjciFCFbd3AvVjf7vPhNqLb6WuiYvbRxh6GxapSzFdeBr1tLcnpzysrhCEboZFvhAYZbUQ9c9XpdVlTpxbhvL0nTac9XYxFXd0ACzBd3RNPfDno81irqm5QfUhWv1GumXC0MsoJdB6vz7yie6UHJ1Rs5CyqY796GoojKEMrrIhqDPuNd2CmiSD7UBeweG1QEyxB0RGTyZzxhI7TJUASI101aVFmj2HWpDNYWfyudv7hTlmJ9rugD9aGjkui8twNM3GRNqkNGmH5kdqrMOxCf1blSSwvc7HTRZ

real    0m0.143s
user    0m0.090s
sys     0m0.064s

so negligibly different with 10 times the password length.

Also half length generation (50 chars):

Pi4:~ # time /usr/bin/pass generate --no-symbols --force python-keyring/rockstor/SECRET_KEY 50
The generated password for python-keyring/rockstor/SECRET_KEY is:
WO28pT61YFMKcdEoJnnaNnAGXZS4oJ94t4lMBAgEhDLEpbcimk

real    0m0.140s
user    0m0.076s
sys     0m0.069s

So this does not seem to be the issue: plus multiple concurrent virtual aarch64 VM instances on the Pi4 (non bare metal) did not show this boot-up failure in rockstor-pre.

Above tests used the published TW installer for Pi4.

But we still have CPU: 827ms indicated in this service!

phillxnet commented 2 months ago

Another potential cause here could be Pi4 (bare metal) has not real-time-clock! During boot-up, on a TW rockstor installer resulting instance, we see:

May 16 11:02:42 Pi4 smbd[1000]: [2024/05/16 11:02:42.237706,  0] ../../source3/smbd/server.c:1746(main)
May 16 11:02:42 Pi4 smbd[1000]:   smbd version 4.19.5-git.342.57620c4f7e1.1SUSE-oS16.9-aarch64 started.
May 16 11:02:42 Pi4 smbd[1000]:   Copyright Andrew Tridgell and the Samba Team 1992-2023
May 16 11:02:42 Pi4 systemd[1]: Started Samba SMB Daemon.
May 16 11:02:42 Pi4 systemd[1]: Reached target Multi-User System.
May 16 11:02:42 Pi4 systemd[1]: Startup finished in 2.762s (kernel) + 7.034s (initrd) + 1min 15.330s (userspace) = 1min 25.127s.
May 16 11:02:42 Pi4 agetty[969]: failed to open credentials directory
May 16 11:02:42 Pi4 agetty[972]: failed to open credentials directory
May 16 11:02:45 Pi4 chronyd[811]: Selected source 91.209.16.78 (2.opensuse.pool.ntp.org)
May 16 11:02:45 Pi4 chronyd[811]: System clock wrong by 5447576.763763 seconds
Jul 18 12:15:44 Pi4 kernel: BTRFS info (device sda3): scrub: started on devid 1
Jul 18 12:15:44 Pi4 kernel: net_ratelimit: 36 callbacks suppressed
Jul 18 12:15:44 Pi4 kernel: brcmfmac: brcmf_set_channel: set chanspec 0x100c fail, reason -52
Jul 18 12:15:44 Pi4 kernel: brcmfmac: brcmf_set_channel: set chanspec 0x100d fail, reason -52
Jul 18 12:15:44 Pi4 kernel: brcmfmac: brcmf_set_channel: set chanspec 0x100e fail, reason -52
Jul 18 12:15:44 Pi4 kernel: brcmfmac: brcmf_set_channel: set chanspec 0xd022 fail, reason -52
Jul 18 12:15:42 Pi4 chronyd[811]: System clock was stepped by 5447576.763763 seconds
Jul 18 12:15:42 Pi4 systemd[1]: Started Discard unused blocks on a mounted filesystem.

I.e. we are jumping from May 16th to today July 18th via chronyd on each reboot. Subsesquent runs of the same rockstor-pre service then run OK, and all aarch64 testing was done on VM's which also have, by host inheritance, working rtc.

chronyd service status

Pi4:~ # systemctl status chronyd
● chronyd.service - NTP client/server
     Loaded: loaded (/usr/lib/systemd/system/chronyd.service; enabled; preset: enabled)
     Active: active (running) since Thu 2024-05-16 11:02:38 WEST; 2 months 2 days ago
       Docs: man:chronyd(8)
             man:chrony.conf(5)
    Process: 790 ExecStart=/usr/sbin/chronyd $OPTIONS (code=exited, status=0/SUCCESS)
    Process: 813 ExecStartPost=/usr/libexec/chrony/helper update-daemon (code=exited, status=0/SUCCESS)
   Main PID: 811 (chronyd)
      Tasks: 1
        CPU: 687ms
     CGroup: /system.slice/chronyd.service
             └─811 /usr/sbin/chronyd

May 16 11:02:37 Pi4 systemd[1]: Starting NTP client/server...
May 16 11:02:37 Pi4 chronyd[811]: chronyd version 4.5 starting (+CMDMON +NTP +REFCLOCK +RTC +PRIVDROP +SCFILTER +SIGND +ASYNCDNS +NTS +SECHASH +IPV6 -DEBUG)
May 16 11:02:37 Pi4 chronyd[811]: Frequency 8.326 +/- 8.772 ppm read from /var/lib/chrony/drift
May 16 11:02:38 Pi4 systemd[1]: Started NTP client/server.
May 16 11:02:45 Pi4 chronyd[811]: Selected source 91.209.16.78 (2.opensuse.pool.ntp.org)
May 16 11:02:45 Pi4 chronyd[811]: System clock wrong by 5447576.763763 seconds
Jul 18 12:15:42 Pi4 chronyd[811]: System clock was stepped by 5447576.763763 seconds
Jul 18 12:38:24 Pi4 chronyd[811]: Selected source 194.117.47.42 (2.opensuse.pool.ntp.org)
phillxnet commented 2 months ago

A potential helper

Pi4:~ # systemctl status chrony-wait.service ○ chrony-wait.service - Wait for chrony to synchronize system clock Loaded: loaded (/usr/lib/systemd/system/chrony-wait.service; disabled; preset: disabled) Active: inactive (dead) Docs: man:chronyc(1)

N.B. likely an akin to NetworkManager-wait-online.service that we use in rockstor-build.service as we need internet acccess to Pypi to build our .venv.

phillxnet commented 2 months ago

TW test of default chrony-wait.service

Once booted already so chronyd.service is done with syncing rtc/ntp already.

Pi4:~ # systemctl start chrony-wait.service 
Pi4:~ # systemctl status chrony-wait.service 
● chrony-wait.service - Wait for chrony to synchronize system clock
     Loaded: loaded (/usr/lib/systemd/system/chrony-wait.service; disabled; preset: disabled)
     Active: active (exited) since Thu 2024-07-18 13:00:35 WEST; 9s ago
       Docs: man:chronyc(1)
    Process: 1619 ExecStart=/usr/bin/chronyc -h 127.0.0.1,::1 waitsync 0 0.1 0.0 1 (code=exited, status=0/SUCCESS)
   Main PID: 1619 (code=exited, status=0/SUCCESS)
        CPU: 134ms

Jul 18 13:00:34 Pi4 systemd[1]: Starting Wait for chrony to synchronize system clock...
Jul 18 13:00:35 Pi4 systemd[1]: Finished Wait for chrony to synchronize system clock.
phillxnet commented 2 months ago

Fix (edited)

Adding:

After=chrony-wait.service

&

> Requires=chrony-wait.service

Wants=chrony-wait.service

to rockstor-pre.service.

Fix tested on a Rockstor 15.6 based install from our downloadable installer. Using a bare metal Pi4: --- 10 sequential boot/reboots were accomplished without rockstor* service start issues. --- All upstream updates, as of this comment, applied. Rockstor package version 5.0.9-0, as per current installers.

How-to

nano /opt/rockstor/conf/rockstor-pre.service

[Unit] section then reads:

[Unit]
Description=Tasks required prior to starting Rockstor
After=rockstor-build.service
After=postgresql.service
After=NetworkManager.service
After=chrony-wait.service
Requires=rockstor-build.service
Requires=postgresql.service
Requires=NetworkManager.service
Wants=chrony-wait.service

Apply the change via:

systemctl start rockstor-bootstrap.service

Web-UI then available and can be used to test reboot capability.

phillxnet commented 2 months ago

The last comment indicates our chosen fix, which makes some sense: however it may pertain more to some-other interdependency. However it makes sense that we ensure NTP updates the clock, especially on non persistent RTC systems such as the Pi4, as we then have better time reporting on our rockstor services:

Pi4:~ # systemctl status rockstor-pre.service 
● rockstor-pre.service - Tasks required prior to starting Rockstor
     Loaded: loaded (/usr/lib/systemd/system/rockstor-pre.service; enabled; preset: disabled)
     Active: active (exited) since Thu 2024-07-18 16:49:15 WEST; 10min ago
    Process: 1860 ExecStart=/usr/local/bin/poetry run initrock (code=exited, status=0/SUCCESS)
   Main PID: 1860 (code=exited, status=0/SUCCESS)
      Tasks: 5 (limit: 4915)
        CPU: 1min 12.097s
     CGroup: /system.slice/rockstor-pre.service
             ├─2846 gpg-agent --homedir /root/.gnupg --use-standard-socket --daemon
             └─2850 scdaemon --multi-server

I.e. no more references back 1 to 2 months to installer build date:

Active: failed (Result: exit-code) since Mon 2024-04-22 13:00:14 WEST; 1 month 1 day ago

phillxnet commented 2 months ago

N.B. an edit of the above fix to Wanting rather than Requiring the chrony-wait.service accounted for the below note re one time-out seen on chrony-wait.service.


Further testing indicates that on occasions we see failures in the chrony-wait.service (timeout in the following case):

Pi4:~ # systemctl status chrony-wait.service
× chrony-wait.service - Wait for chrony to synchronize system clock
     Loaded: loaded (/usr/lib/systemd/system/chrony-wait.service; disabled; preset: disabled)
     Active: failed (Result: timeout) since Mon 2024-06-24 13:03:16 WEST; 3 weeks 3 days ago
       Docs: man:chronyc(1)
    Process: 855 ExecStart=/usr/bin/chronyc -h 127.0.0.1,::1 waitsync 0 0.1 0.0 1 (code=exited, status=1/FAILURE)
   Main PID: 855 (code=exited, status=1/FAILURE)
        CPU: 148ms

Jun 24 13:00:15 Pi4 systemd[1]: Starting Wait for chrony to synchronize system clock...
Jun 24 13:03:16 Pi4 systemd[1]: chrony-wait.service: start operation timed out. Terminating.
Jun 24 13:03:16 Pi4 systemd[1]: chrony-wait.service: Main process exited, code=exited, status=1/FAILURE
Jun 24 13:03:16 Pi4 systemd[1]: chrony-wait.service: Failed with result 'timeout'.
Jun 24 13:03:16 Pi4 systemd[1]: Failed to start Wait for chrony to synchronize system clock.

Which in turn fails our services. This may be more an artefact of this service's inadequate timeout, or an intermittent failure otherwise. Looking to alternative approaches.

phillxnet commented 2 months ago

time-sync.target

https://www.freedesktop.org/software/systemd/man/latest/systemd.special.html#time-sync.target In the last comments noted fail, due to chrony-wait.service's timeout this target was still indicated as having been established, even though the chrony-wait.service failed.

Services where accurate time is essential should be ordered after this unit, but not pull it in.

Wants=time-sync.target

This time-sync.target did not seem to work for us! Reverting to now edited prior fix.

phillxnet commented 2 months ago

Closing as: Fixed by #2879