Closed mstormi closed 3 years ago
cc: @holgerfriedrich
I'm not sure. I don't really have much time to debug today but I should be able to work on it tomorrow.
We may need to invest some time in making the ZRAM tool openHABian uses safer and more reliable. Perhaps after I finish #1055 I can look into rewriting and updating ZRAM.
@mstormi, I think I might have found it.
In the zram-config.service
file the line
RemainAfterExit=yes
Should be
RemainAfterExit=true
Try changing that and see if it works.
EDIT:
I don't think that actually works but here is a link that explains how to run a stop command before every shutdown.
systemd.service(5) manpage says
RemainAfterExit=
Takes a boolean value that specifies whether the service shall be considered active even when all its processes exited. Defaults to no.
so "no" is boolean according to devs?
"zram-config stop" should create a log line in /usr/local/share/zram-config/logs/zram-config.log but that does not appear so I'm not sure it's called at shutdown/reboot time at all.
I created another service to just "ExecStart=zram-config stop" on shutdown but it is not even listed by systemctl list-dependencies so I'm a bit lost where the error might be ATM
Did you run sudo systemctl dameon-reload
?
of course I now get to list it but it does not exec
Is it because ZRAM is PartOf openhab2 and openhab2 is not explicitly stopped on reboot?
PartOf means to stop ZRAM when OH2 is stopped, that works if entered manually
I test with date | sudo tee /var/lib/openhab2/persistence/date;systemctl stop openhab2;zramctl;reboot
so if that worked, it would strange thing sometimes it works, most of the time not
Have you tried a fresh install to see if it is an isolated issue?
just trying ATM
happened the same with fresh install.
After some play forth and back I came up with this zram-sync.service
below.
That now seems to work although it does not explain why it's needed (zram-config already should do).
And as the problem seemed to have been intermittent I'm not sure it will not happen again with this fix
I've added this to the openhabian-zram repo so it gets deployed on install. No need hence to fix anything inside openHABian for the time being.
[Unit]
Description=zram sync on shutdown/reboot
PartOf=openhab2.service
Before=reboot.target shutdown.target poweroff.target halt.target
[Service]
Type=oneshot
ExecStart=/usr/local/bin/zram-config stop
ExecStop=/usr/local/bin/zram-config stop
TimeoutSec=0
RemainAfterExit=true
[Install]
WantedBy=basic.target reboot.target shutdown.target poweroff.target halt.target
Is it because ZRAM is PartOf openhab2 and openhab2 is not explicitly stopped on reboot?
rereading this I think I skipped because I misunderstood .... do you know if openhab2 stops on reboot ? Didn't even think of that possibility.
No idea, does running sudo systemctl stop zram-config.service
work properly? If so then it a bug somewhere in the reboot process that causes it not to run.
I now do date|sudo tee /var/lib/openhab2/persistence/date;zramctl;reboot
and look at the file after boot.
Sometimes it's updated on boot, sometimes it is not. I JUST DONT UNDERSTAND IT, damn it.
The problem must be in the shutdown sequence.
[Getting additionally mad because asking in parallel some non-UNIX guy to reproduce on his system, having to tell him every simplest step. I rather need a pro to work with.]
Could you try reproducing on your system ? Update to master and uninstall+reinstall ZRAM to get zram-sync and latest changes working before you test.
At times I noticed: egrep -i 'start|stop' /usr/local/share/zram-config/log/zram-config.log shows no entry ? zram-config should create when it runs
Just found a hint but not sure what to make of it. The lines below to contain "... to break ordering cycle ..." are in red. So we have a circular dependency ? That would explain the non-reproducible nature of the errors because the cycle will be broken at some arbitrary place, sometimes resulting in proper unmount/sync and sometimes not. But I don't fully understand it yet where the cycle exactly is, or is it multiple ?
It's consisting at least of A) local-fs.target, B) srv-* mount units to loopback-mount /etc/openhab2/... to /srv for Samba sharing, and C) zram-config C) must run afterA (=local-fs.target, the original author did it but it makes sense as we need to have a dir mounted before we can turn it into a zram one) B) should be after A, too, because the dir must exist before we can create a loopback-mount copy to /srv C) must be started after A, too, else it's sharing the lowerFS of the ZRAM'ed dirs and not the upper/union FS to contain the most up to date files. But why is A depending on any of B,C or indirectly ?
Log 1 with the simple cycle:
Aug 09 15:11:04 devpi systemd[1]: zram-config.service: Found ordering cycle on local-fs.target/stop
Aug 09 15:11:04 devpi systemd[1]: zram-config.service: Found dependency on srv-openhab2\x2duserdata.mount/stop
Aug 09 15:11:04 devpi systemd[1]: zram-config.service: Found dependency on zram-config.service/stop
Aug 09 15:11:04 devpi systemd[1]: zram-config.service: Job local-fs.target/stop deleted to break ordering cycle starting with zram-config.service/stop
Aug 09 15:11:17 devpi systemd[1]: Stopping zram-config...
Log 2 with another cycle:
-- Logs begin at Sun 2020-08-09 13:09:38 CEST, end at Sun 2020-08-09 14:29:41 CEST. --
Aug 09 14:11:02 devpi systemd[1]: sysinit.target: Found ordering cycle on local-fs.target/start
Aug 09 14:11:02 devpi systemd[1]: sysinit.target: Found dependency on srv-openhab2\x2daddons.mount/stop
Aug 09 14:11:02 devpi systemd[1]: sysinit.target: Found dependency on zram-config.service/stop
Aug 09 14:11:02 devpi systemd[1]: sysinit.target: Found dependency on basic.target/stop
Aug 09 14:11:02 devpi systemd[1]: sysinit.target: Found dependency on sockets.target/stop
Aug 09 14:11:02 devpi systemd[1]: sysinit.target: Found dependency on avahi-daemon.socket/stop
Aug 09 14:11:02 devpi systemd[1]: sysinit.target: Found dependency on sysinit.target/stop
Aug 09 14:11:02 devpi systemd[1]: sysinit.target: Job local-fs.target/start deleted to break ordering cycle starting with sysinit.ta
Aug 09 14:11:03 devpi systemd[1]: Stopping zram-config...
(hell it would be easiest to remove that /srv crap, but please help me to find a more elegant solution)
Hmm, I've apparently managed to get rid of the cycle it's not showing errors now but it doesn't help. Seems I still cannot get the script to execute on shutdown.
I'll investigate it some this afternoon, I am not sure what is happening. Does making openHAB and frontail stop before shutdown in addition to zram-config help?
I've made more investigations but fail to find the root cause.
First, I got rid of the cycle. Ensure you have none (use journalctl --no-pager to see red warnings for cycles).
Use latest zram-config service https://github.com/mstormi/openhabian-zram/blob/master/zram-config.service
I've added (und meanwhile retracted) a zram-sync service. Retracted because I believe zram-config IS properly being executed on every reboot/shutdown, i.e. zram-config stop is always called so that cannot be the problem. You can see with journalctl -xu zram-config --no-pager.
But I fail to understand why it sometimes does not sync a file from ZRAM back to SD and sometimes it does.
Usually I test with date|sudo tee /var/lib/openhab2/persistence/date;sync;reboot and see if the file contains the updated timestamp
Is it possible that the stop command is not getting enough time to finish, or the system is already shutting down certain areas that ZRAM needs to sync before ZRAM can get to them?
I've encountered at least 2 cases. See /usr/local/bin/zram-config. The 1st case is that the lazy unmount (umount ... -l around line 181) in removeZdir()fails and mergeOverlay() isn't called. In the 2nd case it is but the list of sync files is empty and I don't understand why
Possibly because of global variables not being changed? I see what issue you are running into, and agree that it appears to be caused by the ZRAM code.
As I said, it may be best to completely rewrite ZRAM and use it for our purposes.
As I said, it may be best to completely rewrite ZRAM and use it for our purposes.
I can get you write access to my repo if you like ... and the original author no longer works on it and offered to transfer the project to anyone willing to continue. But it's not a matter of rewriting for our purposes. Frankly I wouldn't know what to do at that point :-|
See https://github.com/StuartIanNaylor/zram-config/issues/41#issuecomment-672611672
Don't know how to resolve for the time being.
But I came to think that with #1077, I can introduce a "zramsync" service to sync zram dirs to external disk on shutdown and to restore from there on boot.
Possibly. It may be a better idea to look at the OverlayFS code and see if we can find a proper fix though.
look at the OverlayFS code and see if we can find a proper fix though.
Feel free to give it a try, but given the ZRAM author himself didn't succeed there (see his comments) I'm sceptic here.
hopefully fixed by https://github.com/mstormi/openhabian-zram/commit/589ca5984b2cd196a10a66638d2336994772792a
this is rsync'ing ZRAM dirs only (not "log" dirs) to (extra) disk on shutdown and recovers them again on boot.
Download zramsync to /usr/local/bin and update zram-config.service at /etc/systemd/system/, both from the openhabian-zram repo. Please help with testing. Use journalctl -xu zram-config.service to see last runs (both, stop and start).
I've attempted multiple forms, rsync and tar, the joint FS and just the upper FS in /opt/zram/zramX and I think they all worked. But it just doesn't sync.
I'm getting mad at this. I must be overlooking something serious. Please Ethan or Holger take over or at least a look. I just cannot find out why it doesn't work. Another pair of eyes sees different things than mine do I hope.
When I use zramsync interactively everything is working. But called from the service it doesn't work. Ensure service is running Grab the output of journalctl -xu zram-config or journalctl -xe to see for yourself and help me find out please. I see zramctl output all dirs but on accessing /opt/zram/zramX/upper, it's gone. Or the file I created for the test just isn't there. I even created another another service. It runs but the modified file again isn't there. As I said already more than once, I'm getting mad at this.
At times while experimenting zramsync didn't run because of a dependency cycle so watch out for that. But with the zram-config.service in the openhabian-zram repo it is working with cycles I believe so that's probaby not the root cause.
I have been working on a rewrite, and am testing it. Also overlayfs was recently updated, it may be fixed now. I'll update you when I have more informatiion on testing.
Glad to hear. But mind that I think all those versions of zramsync I tried were fine and that the problem is not in my new or existing code. In addition to that - but not as the root cause - it was not always executed on shutdown while I've been tinkering with systemd options but I think at least that's working now.
Also overlayfs was recently updated
get me a pointer to describe those changes please
I had a couple of outages on my production system and most interestingly, zram-config.service didn't start. A mount unit with Before=zram-config.service) didn't come up either. I could not find out why but there must be reasons inside systemd dependency chain. When I manually issued start for those two, they started fine.
@mstormi maybe we should add some syslog logging functionality to our services to ease tracking. Did you look into jornalctrl
- no hint what caused service to fail?
unfortunately not, apparently it already wasn't up before last-but-one outage yesterday, and journal is on var/log hence ZRAM so no output recorded since yesterday :( I already added "-x" to zram scripts so all lines appear as debug output in journalctl -xu ... but that doesn't help in tracking why the service doesn't start. Systemd simply seems not to issue the start .... Yes anything such as logging to help untangling is good. Any more concrete idea ?
journalctl -b -p err
does not help?
use logger
to put additional info to the log.
journalctl -b -p err
does not help?
no. I see a NFS mount unit failing (but not why) but that's it. I've seen a messa on openhab2 timeout and now set all timeouts to 0 for a test
still trying to debug
Latest finding: I have zram-config.service properly execute on start and stop on reboot. I execute a script and add debug output in that script. On stop, the ZRAM dirs seem to already be unmounted by the time the stop script executes.
Any idea how to find what unmounted them before ?
Log of stop script (on a freshly installed box), note the mountpoint:
[16:01:43] root@openhab:/storage/zram# cat zramsync.log
zramsync creating backup @ Mon 24 Aug 16:00:41 CEST 2020
inFile=/usr/local/share/zram-config/zram-device-list
swap /zram0 zram-config0
dir /zram1 /var/lib/openhab2/persistence /persistence.bind
log /zram2 /var/log /log.bind
NAME ALGORITHM DISKSIZE DATA COMPR TOTAL STREAMS MOUNTPOINT
/dev/zram2 lzo-rle 500M 17.2M 81.9K 432K 4
/dev/zram1 lzo-rle 500M 16.4M 10.1K 96K 4
/dev/zram0 lzo-rle 600M 4K 87B 12K 4 [SWAP]
overlay2 479560 9396 434324 3% /var/log
/opt/zram/zram1/upper
zramsync end run @ Mon 24 Aug 16:00:41 CEST 2020
Just found out about the RequiresMountsFor=
option in .services.
Added that to the zram-config.service
in openhabian-zram
repo.
Looks good at first sight, cross your fingers ....
I'm hopeful, this has been a real PITA. Ideally that would allow for proper execution without need for zramsync.
Unfortunately not (I tried right away, of course) but at least we only need a single service.
So it is working correctly and syncing with zramsync now right?
Apparently yes but I'd like others to validate.
I will test this afternoon.
eventually use the latest image with clonebranch=testbuild so you can validate multiple PRs in one go. (else you need to fetch latest openhabian-zram repo)
I have to wait to test the other PRs until I get a usb SD card reader as I don't have one available at the moment. I'll just uninstall and reinstall zram as that will update the repo.
I don't think it is working correctly as the RequiresMountsFor would prevent it from starting zram on boot as those DIrs would not be mounted and as such would fail.
See https://github.com/mstormi/openhabian-zram/pull/5 for continued info
I presume you have seen https://community.openhab.org/t/mapdb-no-restore-at-startup-for-item-status/102835/61 But I just don't know what else to check now that zram start seems to always work on my test system. My hope is just it is because of storage.mount and dependencies intro'd in #1077.
Did you have him reinstall the timesync dhcp hook to see if that fixes the issue?
not explicitly but I had him reinstall and as that's in master now it should be there.
See my latest post at https://community.openhab.org/t/mapdb-no-restore-at-startup-for-item-status/102835/66
Not sure why local-fs.target depends on srv-openhab2*.mount
.
/srv/... is loopback mounts that require the mount they loop back to exist so After=local-fs.target in their .service definitions should be about right.
Eventually any .mount
implicitly is part oflocal-fs.target
?
See also https://unix.stackexchange.com/questions/519221/how-can-i-solve-this-ordering-cycle-in-a-mount-unit
I'm thinking to remove those /srv
mounts and export the real dir(s) instead but that's only the second best option.
If possible I'd like to keep those.
As there's reports of mapdb persistence data (to run on ZRAM) not being restored, I've investigated how zram behaves and found out that if you simply 'reboot', changes sometimes are lost.
I'm asking all of you for help. I just don't understand what's going on here.
[ hint for debugging: the lower FS contains the directory as it existed when ZRAM starts for a dir. It's mounted under /opt/ram/*.bind so I use [
date>/var/lib/openhab2/persistence/date
andcat /opt/zram/persistence.bind/date /var/lib/openhab2/persistence/date
after startup to see the last modification date ]On boot persistence files are what they used to be on last time's zram start. When you manually systemctl stop zram-config before boot, files are apparently correctly synced to lower FS (on SD card).
So this seems to be a systemd config problem !? Apparently there's no systemctl stop zram-config being carried out ? Anybody proficient in debugging systemd startup/shutdown order ?Preliminary investigations show zram-config stop is called. But it does not always work.
Another possibly relevant detail: freshly after boot, with oh2 running, I get a warning although I don't modify any systemd config files:
There's a known bug to output this but I'm not sure if that is what I hit because that is said to be just cosmetic.