openhab / openhabian

openHABian - empowering the smart home, for Raspberry Pi and Debian systems
https://community.openhab.org/t/13379
ISC License
818 stars 251 forks source link

not syncing on ZRAM shutdown #1078

Closed mstormi closed 3 years ago

mstormi commented 3 years ago

As there's reports of mapdb persistence data (to run on ZRAM) not being restored, I've investigated how zram behaves and found out that if you simply 'reboot', changes sometimes are lost.

I'm asking all of you for help. I just don't understand what's going on here.

[ hint for debugging: the lower FS contains the directory as it existed when ZRAM starts for a dir. It's mounted under /opt/ram/*.bind so I use [ date>/var/lib/openhab2/persistence/date and cat /opt/zram/persistence.bind/date /var/lib/openhab2/persistence/date after startup to see the last modification date ]

On boot persistence files are what they used to be on last time's zram start. When you manually systemctl stop zram-config before boot, files are apparently correctly synced to lower FS (on SD card). So this seems to be a systemd config problem !? Apparently there's no systemctl stop zram-config being carried out ? Anybody proficient in debugging systemd startup/shutdown order ?

Preliminary investigations show zram-config stop is called. But it does not always work.

Another possibly relevant detail: freshly after boot, with oh2 running, I get a warning although I don't modify any systemd config files:

[15:21:09] root@openhab:/home/openhabian# systemctl stop openhab2
Warning: The unit file, source configuration file or drop-ins of openhab2.service changed on disk. Run 'systemctl daemon-reload' to reload units.

There's a known bug to output this but I'm not sure if that is what I hit because that is said to be just cosmetic.

mstormi commented 3 years ago

cc: @holgerfriedrich

ecdye commented 3 years ago

I'm not sure. I don't really have much time to debug today but I should be able to work on it tomorrow.

We may need to invest some time in making the ZRAM tool openHABian uses safer and more reliable. Perhaps after I finish #1055 I can look into rewriting and updating ZRAM.

ecdye commented 3 years ago

@mstormi, I think I might have found it.

In the zram-config.service file the line

RemainAfterExit=yes

Should be

RemainAfterExit=true

Try changing that and see if it works.

EDIT:

I don't think that actually works but here is a link that explains how to run a stop command before every shutdown.

mstormi commented 3 years ago

systemd.service(5) manpage says

   RemainAfterExit=
       Takes a boolean value that specifies whether the service shall be considered active even when all its processes exited. Defaults to no.

so "no" is boolean according to devs?

mstormi commented 3 years ago

"zram-config stop" should create a log line in /usr/local/share/zram-config/logs/zram-config.log but that does not appear so I'm not sure it's called at shutdown/reboot time at all.

I created another service to just "ExecStart=zram-config stop" on shutdown but it is not even listed by systemctl list-dependencies so I'm a bit lost where the error might be ATM

ecdye commented 3 years ago

Did you run sudo systemctl dameon-reload?

mstormi commented 3 years ago

of course I now get to list it but it does not exec

ecdye commented 3 years ago

Is it because ZRAM is PartOf openhab2 and openhab2 is not explicitly stopped on reboot?

mstormi commented 3 years ago

PartOf means to stop ZRAM when OH2 is stopped, that works if entered manually

I test with date | sudo tee /var/lib/openhab2/persistence/date;systemctl stop openhab2;zramctl;reboot

so if that worked, it would strange thing sometimes it works, most of the time not

ecdye commented 3 years ago

Have you tried a fresh install to see if it is an isolated issue?

mstormi commented 3 years ago

just trying ATM

mstormi commented 3 years ago

happened the same with fresh install. After some play forth and back I came up with this zram-sync.service below. That now seems to work although it does not explain why it's needed (zram-config already should do). And as the problem seemed to have been intermittent I'm not sure it will not happen again with this fix

I've added this to the openhabian-zram repo so it gets deployed on install. No need hence to fix anything inside openHABian for the time being.

[Unit]
Description=zram sync on shutdown/reboot
PartOf=openhab2.service
Before=reboot.target shutdown.target poweroff.target halt.target

[Service]
Type=oneshot
ExecStart=/usr/local/bin/zram-config stop
ExecStop=/usr/local/bin/zram-config stop
TimeoutSec=0
RemainAfterExit=true

[Install]
WantedBy=basic.target reboot.target shutdown.target poweroff.target halt.target
mstormi commented 3 years ago

Is it because ZRAM is PartOf openhab2 and openhab2 is not explicitly stopped on reboot?

rereading this I think I skipped because I misunderstood .... do you know if openhab2 stops on reboot ? Didn't even think of that possibility.

ecdye commented 3 years ago

No idea, does running sudo systemctl stop zram-config.service work properly? If so then it a bug somewhere in the reboot process that causes it not to run.

mstormi commented 3 years ago

I now do date|sudo tee /var/lib/openhab2/persistence/date;zramctl;reboot and look at the file after boot. Sometimes it's updated on boot, sometimes it is not. I JUST DONT UNDERSTAND IT, damn it. The problem must be in the shutdown sequence. [Getting additionally mad because asking in parallel some non-UNIX guy to reproduce on his system, having to tell him every simplest step. I rather need a pro to work with.]

Could you try reproducing on your system ? Update to master and uninstall+reinstall ZRAM to get zram-sync and latest changes working before you test.

At times I noticed: egrep -i 'start|stop' /usr/local/share/zram-config/log/zram-config.log shows no entry ? zram-config should create when it runs

Just found a hint but not sure what to make of it. The lines below to contain "... to break ordering cycle ..." are in red. So we have a circular dependency ? That would explain the non-reproducible nature of the errors because the cycle will be broken at some arbitrary place, sometimes resulting in proper unmount/sync and sometimes not. But I don't fully understand it yet where the cycle exactly is, or is it multiple ?

It's consisting at least of A) local-fs.target, B) srv-* mount units to loopback-mount /etc/openhab2/... to /srv for Samba sharing, and C) zram-config C) must run afterA (=local-fs.target, the original author did it but it makes sense as we need to have a dir mounted before we can turn it into a zram one) B) should be after A, too, because the dir must exist before we can create a loopback-mount copy to /srv C) must be started after A, too, else it's sharing the lowerFS of the ZRAM'ed dirs and not the upper/union FS to contain the most up to date files. But why is A depending on any of B,C or indirectly ?

Log 1 with the simple cycle:

Aug 09 15:11:04 devpi systemd[1]: zram-config.service: Found ordering cycle on local-fs.target/stop
Aug 09 15:11:04 devpi systemd[1]: zram-config.service: Found dependency on srv-openhab2\x2duserdata.mount/stop
Aug 09 15:11:04 devpi systemd[1]: zram-config.service: Found dependency on zram-config.service/stop
Aug 09 15:11:04 devpi systemd[1]: zram-config.service: Job local-fs.target/stop deleted to break ordering cycle starting with zram-config.service/stop
Aug 09 15:11:17 devpi systemd[1]: Stopping zram-config...

Log 2 with another cycle:


-- Logs begin at Sun 2020-08-09 13:09:38 CEST, end at Sun 2020-08-09 14:29:41 CEST. --
Aug 09 14:11:02 devpi systemd[1]: sysinit.target: Found ordering cycle on local-fs.target/start
Aug 09 14:11:02 devpi systemd[1]: sysinit.target: Found dependency on srv-openhab2\x2daddons.mount/stop
Aug 09 14:11:02 devpi systemd[1]: sysinit.target: Found dependency on zram-config.service/stop
Aug 09 14:11:02 devpi systemd[1]: sysinit.target: Found dependency on basic.target/stop
Aug 09 14:11:02 devpi systemd[1]: sysinit.target: Found dependency on sockets.target/stop
Aug 09 14:11:02 devpi systemd[1]: sysinit.target: Found dependency on avahi-daemon.socket/stop
Aug 09 14:11:02 devpi systemd[1]: sysinit.target: Found dependency on sysinit.target/stop
Aug 09 14:11:02 devpi systemd[1]: sysinit.target: Job local-fs.target/start deleted to break ordering cycle starting with sysinit.ta
Aug 09 14:11:03 devpi systemd[1]: Stopping zram-config...

(hell it would be easiest to remove that /srv crap, but please help me to find a more elegant solution)
mstormi commented 3 years ago

Hmm, I've apparently managed to get rid of the cycle it's not showing errors now but it doesn't help. Seems I still cannot get the script to execute on shutdown.

ecdye commented 3 years ago

I'll investigate it some this afternoon, I am not sure what is happening. Does making openHAB and frontail stop before shutdown in addition to zram-config help?

mstormi commented 3 years ago

I've made more investigations but fail to find the root cause.

First, I got rid of the cycle. Ensure you have none (use journalctl --no-pager to see red warnings for cycles). Use latest zram-config service https://github.com/mstormi/openhabian-zram/blob/master/zram-config.service

I've added (und meanwhile retracted) a zram-sync service. Retracted because I believe zram-config IS properly being executed on every reboot/shutdown, i.e. zram-config stop is always called so that cannot be the problem. You can see with journalctl -xu zram-config --no-pager.

But I fail to understand why it sometimes does not sync a file from ZRAM back to SD and sometimes it does.

Usually I test with date|sudo tee /var/lib/openhab2/persistence/date;sync;reboot and see if the file contains the updated timestamp

ecdye commented 3 years ago

Is it possible that the stop command is not getting enough time to finish, or the system is already shutting down certain areas that ZRAM needs to sync before ZRAM can get to them?

mstormi commented 3 years ago

I've encountered at least 2 cases. See /usr/local/bin/zram-config. The 1st case is that the lazy unmount (umount ... -l around line 181) in removeZdir()fails and mergeOverlay() isn't called. In the 2nd case it is but the list of sync files is empty and I don't understand why

ecdye commented 3 years ago

Possibly because of global variables not being changed? I see what issue you are running into, and agree that it appears to be caused by the ZRAM code.

As I said, it may be best to completely rewrite ZRAM and use it for our purposes.

mstormi commented 3 years ago

As I said, it may be best to completely rewrite ZRAM and use it for our purposes.

I can get you write access to my repo if you like ... and the original author no longer works on it and offered to transfer the project to anyone willing to continue. But it's not a matter of rewriting for our purposes. Frankly I wouldn't know what to do at that point :-|

mstormi commented 3 years ago

See https://github.com/StuartIanNaylor/zram-config/issues/41#issuecomment-672611672

Don't know how to resolve for the time being.

But I came to think that with #1077, I can introduce a "zramsync" service to sync zram dirs to external disk on shutdown and to restore from there on boot.

ecdye commented 3 years ago

Possibly. It may be a better idea to look at the OverlayFS code and see if we can find a proper fix though.

mstormi commented 3 years ago

look at the OverlayFS code and see if we can find a proper fix though.

Feel free to give it a try, but given the ZRAM author himself didn't succeed there (see his comments) I'm sceptic here.

mstormi commented 3 years ago

hopefully fixed by https://github.com/mstormi/openhabian-zram/commit/589ca5984b2cd196a10a66638d2336994772792a

this is rsync'ing ZRAM dirs only (not "log" dirs) to (extra) disk on shutdown and recovers them again on boot.

Download zramsync to /usr/local/bin and update zram-config.service at /etc/systemd/system/, both from the openhabian-zram repo. Please help with testing. Use journalctl -xu zram-config.service to see last runs (both, stop and start).

mstormi commented 3 years ago

I've attempted multiple forms, rsync and tar, the joint FS and just the upper FS in /opt/zram/zramX and I think they all worked. But it just doesn't sync.

I'm getting mad at this. I must be overlooking something serious. Please Ethan or Holger take over or at least a look. I just cannot find out why it doesn't work. Another pair of eyes sees different things than mine do I hope.

When I use zramsync interactively everything is working. But called from the service it doesn't work. Ensure service is running Grab the output of journalctl -xu zram-config or journalctl -xe to see for yourself and help me find out please. I see zramctl output all dirs but on accessing /opt/zram/zramX/upper, it's gone. Or the file I created for the test just isn't there. I even created another another service. It runs but the modified file again isn't there. As I said already more than once, I'm getting mad at this.

At times while experimenting zramsync didn't run because of a dependency cycle so watch out for that. But with the zram-config.service in the openhabian-zram repo it is working with cycles I believe so that's probaby not the root cause.

ecdye commented 3 years ago

I have been working on a rewrite, and am testing it. Also overlayfs was recently updated, it may be fixed now. I'll update you when I have more informatiion on testing.

mstormi commented 3 years ago

Glad to hear. But mind that I think all those versions of zramsync I tried were fine and that the problem is not in my new or existing code. In addition to that - but not as the root cause - it was not always executed on shutdown while I've been tinkering with systemd options but I think at least that's working now.

Also overlayfs was recently updated

get me a pointer to describe those changes please

ecdye commented 3 years ago

https://github.com/kmxz/overlayfs-tools/commit/77bf3f7ad3e7b834a15e2166780167646d51cce8

mstormi commented 3 years ago

I had a couple of outages on my production system and most interestingly, zram-config.service didn't start. A mount unit with Before=zram-config.service) didn't come up either. I could not find out why but there must be reasons inside systemd dependency chain. When I manually issued start for those two, they started fine.

holgerfriedrich commented 3 years ago

@mstormi maybe we should add some syslog logging functionality to our services to ease tracking. Did you look into jornalctrl- no hint what caused service to fail?

mstormi commented 3 years ago

unfortunately not, apparently it already wasn't up before last-but-one outage yesterday, and journal is on var/log hence ZRAM so no output recorded since yesterday :( I already added "-x" to zram scripts so all lines appear as debug output in journalctl -xu ... but that doesn't help in tracking why the service doesn't start. Systemd simply seems not to issue the start .... Yes anything such as logging to help untangling is good. Any more concrete idea ?

holgerfriedrich commented 3 years ago

journalctl -b -p err does not help? use logger to put additional info to the log.

mstormi commented 3 years ago

journalctl -b -p err does not help?

no. I see a NFS mount unit failing (but not why) but that's it. I've seen a messa on openhab2 timeout and now set all timeouts to 0 for a test

mstormi commented 3 years ago

still trying to debug

Latest finding: I have zram-config.service properly execute on start and stop on reboot. I execute a script and add debug output in that script. On stop, the ZRAM dirs seem to already be unmounted by the time the stop script executes.

Any idea how to find what unmounted them before ?

Log of stop script (on a freshly installed box), note the mountpoint:

[16:01:43] root@openhab:/storage/zram# cat zramsync.log
zramsync creating backup @ Mon 24 Aug 16:00:41 CEST 2020
inFile=/usr/local/share/zram-config/zram-device-list
swap            /zram0          zram-config0
dir             /zram1          /var/lib/openhab2/persistence           /persistence.bind
log             /zram2          /var/log                /log.bind
NAME       ALGORITHM DISKSIZE  DATA COMPR TOTAL STREAMS MOUNTPOINT
/dev/zram2 lzo-rle       500M 17.2M 81.9K  432K       4
/dev/zram1 lzo-rle       500M 16.4M 10.1K   96K       4
/dev/zram0 lzo-rle       600M    4K   87B   12K       4 [SWAP]
overlay2          479560    9396    434324   3% /var/log
/opt/zram/zram1/upper
zramsync end run @ Mon 24 Aug 16:00:41 CEST 2020
mstormi commented 3 years ago

Just found out about the RequiresMountsFor= option in .services. Added that to the zram-config.service in openhabian-zram repo. Looks good at first sight, cross your fingers ....

ecdye commented 3 years ago

I'm hopeful, this has been a real PITA. Ideally that would allow for proper execution without need for zramsync.

mstormi commented 3 years ago

Unfortunately not (I tried right away, of course) but at least we only need a single service.

ecdye commented 3 years ago

So it is working correctly and syncing with zramsync now right?

mstormi commented 3 years ago

Apparently yes but I'd like others to validate.

ecdye commented 3 years ago

I will test this afternoon.

mstormi commented 3 years ago

eventually use the latest image with clonebranch=testbuild so you can validate multiple PRs in one go. (else you need to fetch latest openhabian-zram repo)

ecdye commented 3 years ago

I have to wait to test the other PRs until I get a usb SD card reader as I don't have one available at the moment. I'll just uninstall and reinstall zram as that will update the repo.

ecdye commented 3 years ago

I don't think it is working correctly as the RequiresMountsFor would prevent it from starting zram on boot as those DIrs would not be mounted and as such would fail.

ecdye commented 3 years ago

See https://github.com/mstormi/openhabian-zram/pull/5 for continued info

mstormi commented 3 years ago

I presume you have seen https://community.openhab.org/t/mapdb-no-restore-at-startup-for-item-status/102835/61 But I just don't know what else to check now that zram start seems to always work on my test system. My hope is just it is because of storage.mount and dependencies intro'd in #1077.

ecdye commented 3 years ago

Did you have him reinstall the timesync dhcp hook to see if that fixes the issue?

mstormi commented 3 years ago

not explicitly but I had him reinstall and as that's in master now it should be there.

mstormi commented 3 years ago

See my latest post at https://community.openhab.org/t/mapdb-no-restore-at-startup-for-item-status/102835/66

Not sure why local-fs.target depends on srv-openhab2*.mount. /srv/... is loopback mounts that require the mount they loop back to exist so After=local-fs.target in their .service definitions should be about right. Eventually any .mount implicitly is part oflocal-fs.target ? See also https://unix.stackexchange.com/questions/519221/how-can-i-solve-this-ordering-cycle-in-a-mount-unit

I'm thinking to remove those /srv mounts and export the real dir(s) instead but that's only the second best option. If possible I'd like to keep those.