Open gkeishin opened 4 years ago
The log
Failed to set up special execution directory in /var/lib: Not a directory
Failed at step STATE_DIRECTORY spawning /lib/systemd/systemd-timesyncd: Not a directory
indicates that the filesystem is corrupted, or at least /var/lib is not there, maybe check if there is any HW issue?
@geissonator ^^^
I saw it when I was doing upgrade/downgrade testing. Let me see if I can get more info.
HW CI is hitting this intermittently. I don't see any indication in the logs of other applications having issues, but the time application does run the earliest:
Oct 13 23:12:05 witherspoon-Y230UF71K03T systemd[1]: Starting Flush Journal to Persistent Storage...
Oct 13 23:12:06 witherspoon-Y230UF71K03T systemd-journald[80]: Time spent on flushing to /var is 875.947ms for 209 entries.
Oct 13 23:12:06 witherspoon-Y230UF71K03T systemd-journald[80]: System Journal (/var/log/journal/f5a255d7642740388dccdc9dfebd0c5a) is 2.0M, max 2.5M, 496.0K free.
Oct 13 23:12:06 witherspoon-Y230UF71K03T systemd-networkd[84]: Enumeration completed
Oct 13 23:12:06 witherspoon-Y230UF71K03T systemd[1]: Started Network Service.
Oct 13 23:12:07 witherspoon-Y230UF71K03T systemd-udevd[83]: Using default interface naming scheme 'v243'.
Oct 13 23:12:07 witherspoon-Y230UF71K03T systemd-udevd[83]: ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
Oct 13 23:12:07 witherspoon-Y230UF71K03T systemd[1]: Started Flush Journal to Persistent Storage.
Oct 13 23:12:07 witherspoon-Y230UF71K03T systemd[1]: Started udev Coldplug all Devices.
Oct 13 23:12:07 witherspoon-Y230UF71K03T systemd-networkd[84]: eth0: IPv6 successfully enabled
Oct 13 23:12:07 witherspoon-Y230UF71K03T kernel: 8021q: adding VLAN 0 to HW filter on device eth0
Oct 13 23:12:07 witherspoon-Y230UF71K03T kernel: ftgmac100 1e660000.ethernet eth0: NCSI: Handler for packet type 0x82 returned -19
Oct 13 23:12:08 witherspoon-Y230UF71K03T systemd-networkd[84]: eth0: Gained carrier
Oct 13 23:12:09 witherspoon-Y230UF71K03T systemd-networkd[84]: eth0: Gained IPv6LL
Oct 13 23:12:11 witherspoon-Y230UF71K03T systemd[1]: Found device /dev/ttyVUART0.
Oct 13 23:12:12 witherspoon-Y230UF71K03T systemd[1]: Found device /dev/aspeed-lpc-ctrl.
Oct 13 23:12:13 witherspoon-Y230UF71K03T systemd[1]: Created slice system-xyz.openbmc_project.Hwmon.slice.
Oct 13 23:12:14 witherspoon-Y230UF71K03T systemd[1]: Created slice system-xyz.openbmc_project.led.controller.slice.
Oct 13 23:12:14 witherspoon-Y230UF71K03T systemd[1]: Condition check resulted in Huge Pages File System being skipped.
Oct 13 23:12:14 witherspoon-Y230UF71K03T systemd[1]: Condition check resulted in POSIX Message Queue File System being skipped.
Oct 13 23:12:14 witherspoon-Y230UF71K03T systemd[1]: Condition check resulted in Create list of static device nodes for the current kernel being skipped.
Oct 13 23:12:14 witherspoon-Y230UF71K03T systemd[1]: Condition check resulted in Rebuild Hardware Database being skipped.
Oct 13 23:12:14 witherspoon-Y230UF71K03T systemd[1]: Condition check resulted in Commit a transient machine-id on disk being skipped.
Oct 13 23:12:14 witherspoon-Y230UF71K03T systemd-networkd[84]: eth0: Configured
Oct 13 23:12:14 witherspoon-Y230UF71K03T systemd[1]: Condition check resulted in Load Kernel Modules being skipped.
Oct 13 23:12:14 witherspoon-Y230UF71K03T systemd[1]: Condition check resulted in FUSE Control File System being skipped.
Oct 13 23:12:14 witherspoon-Y230UF71K03T systemd[1]: Condition check resulted in Create System Users being skipped.
Oct 13 23:12:14 witherspoon-Y230UF71K03T systemd[1]: Starting Create Volatile Files and Directories...
Oct 13 23:12:15 witherspoon-Y230UF71K03T systemd[1]: Started Create Volatile Files and Directories.
Oct 13 23:12:15 witherspoon-Y230UF71K03T systemd[1]: Condition check resulted in Rebuild Journal Catalog being skipped.
Oct 13 23:12:15 witherspoon-Y230UF71K03T systemd[1]: Starting Network Name Resolution...
Oct 13 23:12:15 witherspoon-Y230UF71K03T systemd[127]: systemd-timesyncd.service: Failed to set up special execution directory in /var/lib: Not a directory
Oct 13 23:12:15 witherspoon-Y230UF71K03T systemd[127]: systemd-timesyncd.service: Failed at step STATE_DIRECTORY spawning /lib/systemd/systemd-timesyncd: Not a directory
Oct 13 23:12:15 witherspoon-Y230UF71K03T systemd[1]: Starting Network Time Synchronization...
Oct 13 23:12:15 witherspoon-Y230UF71K03T systemd[1]: Condition check resulted in Update is Completed being skipped.
Oct 13 23:12:16 witherspoon-Y230UF71K03T systemd[1]: systemd-timesyncd.service: Main process exited, code=exited, status=238/STATE_DIRECTORY
Oct 13 23:12:16 witherspoon-Y230UF71K03T systemd[1]: systemd-timesyncd.service: Failed with result 'exit-code'.
Oct 13 23:12:16 witherspoon-Y230UF71K03T systemd[1]: Failed to start Network Time Synchronization.
Oct 13 23:12:16 witherspoon-Y230UF71K03T systemd[1]: systemd-timesyncd.service: Service has no hold-off time (RestartSec=0), scheduling restart.
Oct 13 23:12:16 witherspoon-Y230UF71K03T systemd[1]: systemd-timesyncd.service: Scheduled restart job, restart counter is at 1.```
My CI system was still in fail state, I restarted the systemd-timesyncd servier after verifying /var/lib/ was there and still hit this issue so something fishy going on here.
The same issue could be found at:
And all indicate that it's related to permission issue with /var/lib/systemd/timesync
On existing witherspoon that has the issue, the directories look as below
# ls -al /var/lib/systemd/timesync
lrwxrwxrwx 1 root root 27 Oct 11 20:08 /var/lib/systemd/timesync -> ../private/systemd/timesync
# ls -al /var/lib/private/
drwx------ 3 root root 224 Oct 11 20:08 .
drwxr-xr-x 18 root root 1448 Oct 14 00:59 ..
drwxr-xr-x 3 root root 232 Oct 11 20:08 systemd
# ls -al /var/lib/private/systemd/timesync/
drwxr-xr-x 2 systemd- systemd- 224 Oct 11 19:45 .
drwxr-xr-x 3 root root 232 Oct 11 20:08 ..
-rw-r--r-- 1 systemd- systemd- 0 Oct 11 20:56 clock
We can see that the /var/lib/systemd/timesync
is a symbol link to /var/lib/private/systemd/timesync
, and it causes the issue.
After removing such directories, the service could be started successfully, and the directory becomes
# ls -al /var/lib/systemd/timesync/
drwxr-xr-x 2 systemd- systemd- 224 Oct 14 03:12 .
drwxr-xr-x 6 root root 504 Oct 14 03:12 ..
-rw-r--r-- 1 systemd- systemd- 0 Oct 14 03:15 clock
So in OpenBMC, the solution could be to do a factory reset.
Or if we need to handle it well, it's needed to find out which OpenBMC releases (with older systemd) uses the symbol link for /var/lib/systemd/timesync
, and add specific scripts to remove it during code update.
Looks like we'd want something like this service to do it - https://github.com/ricardosalveti/meta-lmp/commit/74513770589475f83173dde62a3ffb22cb73f8e0
How to fix if you hit this:
/bin/rm -fv /var/lib/systemd/timesync && /bin/mv /var/lib/private/systemd/timesync /var/lib/systemd/timesync
Notice on CI run https://gerrit.openbmc-project.xyz/#/c/openbmc/meta-phosphor/+/26016/