Omnimount restarts - Githubissues

mahogl commented 1 year ago

omnimount_inspect.txt Dont know if it is just me, but for some reasons Omnimount restarts quite often, and then all the other containers that depends on Omnimount restarts. For me it often looks like its triggered by load, for it often happens when a download starts. It does fixes itself and everything is running afterwards, but sometimes i causes som failed downloads and Sonarr/Radarr tries do download another version.

Don`t know for sure when it startet, but it might be few months ago, first tought it was related to the server i was running on. But i have now changed the server and problem is the same. By the way, the backup/restore feature is working great i had the new server up running in no time.

Let me know if there is some data i can provide, so you can look into the problem when you have some time.

kelinger commented 1 year ago

I think there are some tweaks that still need to occur. In pre-release versions, we had an issue where OmniMount wouldn't catch failures and the results were less than ideal. Plex would be "up" but all videos would be missing, for example. The problem is that in Linux, you create an empty (ideally) directory and then mount a remote on top of it. The path is the same but the directory no longer looks empty. When the mount is dropped, the empty directory remains.

With Docker, we're essentially doing the same thing so you have an external directory mounted to something internal to the container. But since that external directory is actually your Google drive (for example) mounted over that empty directory, the sign of a failure is that the container now sees the empty directory, not a missing directory. Furthermore, when the mount is re-established (often in seconds or even under a second), Docker still has the empty directory mounted in the container despite the overlaid remote mount. This causes things like downloads or changes to be written (at best) to the empty directory causing it to a) no longer be empty which could cause mount issues later on and b) files not to be where you expect them to be.

Now, we have a much better process but it's also sort of brute force. Any failures immediately trigger a container restart. If only Plex loses its mount, only Plex will restart. But if OmniMount loses it, all containers dependent on it will also restart when OmniMount restarts.

If you find it is restarting way too frequently there are a few undocumented tweaks we have already setup for purposes like these. You can change OmniMount's caching, for example, by creating a file called vfs.conf in the root of the OmniMount configs directory. In it, put any combination of these variables with appropriate settings to see if these changes help (default values are shown here):

VFSMAX=100G
VFSAGE=48h
VFSPOLL=5m
VFSREAD=2G
VFSCACHE=yes

Don't expect a lot of error checking here so be sure to use proper values as documented by Rclone. But, if disk space is minimal, for example, you'll want to reduce the cache size from 100G to something like 20G or maybe even turn off the caching with VFSCACHE=no.

There's also a variable called TURBOMAX in the configs. This isn't a menu option but you can edit the raw OmniStream config file with the omni edit command. Look for the line that says TURBOMAX=20 and change it to something lower. This represents how many parallel operations TurboSync will attempt when syncing files up to Google. I ran into similar issues when I was using a slower "less endowed" server for testing purposes and had to reduce it to 10. Before doing so, things appeared to work OK but if my server decided to add an entire TV series (versus one new episode) I'd find things in a really bad state otherwise. Honestly, it really won't be that noticeable unless you're sitting around watching it download/upload.

mahogl commented 1 year ago

Thanks for the reply, i have been testing out different cache settings without success so far. It does not looks like it makes it a difference. Will try the out the TURBOMAX=10 and se if that makes a difference, because my experiences is that it happens under heavy disk load for example if i try to manual import a TV series like you mentioned. One episode is usually not problem, then it`s just a sporadic restart now and then.

The specification of my server: CPU: Intel Xeon E3-1220 v3 - 3.1 GHz - 4 core(s) RAM: 32GB - DDR3 Hard Drive(s): 2x 1TB (SSD SATA) (Raid 1) Bandwidth: Unmetered @ 1Gbps OS: Debian 10

What do you think about testing out increasing the the intervall and timeout on the Omnimount health check? Because from my indication its not related to the rclone sync since it crashes before the sync starts. So its i think the problem is related to that or maybe some problem with MergerFS?

healthcheck: test: ["CMD-SHELL", "/omnimountcheck"] interval: 10s timeout: 5s retries: 1

mahogl commented 1 year ago

Omnimount log during restart:

omnimount | rclone v1.60.0 omnimount | - os/version: debian 11.5 (64 bit) omnimount | - os/kernel: 4.19.0-22-amd64 (x86_64) omnimount | - os/type: linux omnimount | - os/arch: amd64 omnimount | - go/version: go1.19.2 omnimount | - go/linking: static omnimount | - go/tags: none omnimount | omnimount | mergerfs version: 2.33.5 omnimount | omnimount | Starting vnstat omnimount | omnimount | Configuration: omnimount | MERGEMOUNT=cloud omnimount | RCLONESERVICE=gcrypt omnimount | RCLONEMOUNT=gcrypt omnimount | UNSYNCED=unsynced omnimount | UPLOADCACHE=uploadcache omnimount | MEDIA=media omnimount | TURBOMAX=10 omnimount | Adding group omniuser' (GID 1000) ... omnimount | Done. omnimount | Adding useromniuser' ... omnimount | Adding new user omniuser' (1000) with groupomniuser' ... omnimount | Creating home directory /home/omniuser' ... omnimount | Copying files from/etc/skel' ... omnimount | Starting services omnimount | OmniMount Caching: enabled omnimount | { omnimount | "jobid": 1 omnimount | } omnimount | omnimount | Startup complpete omnimount | Received request to shutdown omnimount | fusermount: extra arguments after the mountpoint omnimount | umount: /mnt/gcrypt: target is busy. omnimount | rmdir: failed to remove '/mnt/gcrypt': Device or resource busy omnimount | rmdir: failed to remove '/mnt/uploadcache': Directory not empty omnimount | rclone v1.60.0 omnimount | - os/version: debian 11.5 (64 bit) omnimount | - os/kernel: 4.19.0-22-amd64 (x86_64) omnimount | - os/type: linux omnimount | - os/arch: amd64 omnimount | - go/version: go1.19.2 omnimount | - go/linking: static omnimount | - go/tags: none omnimount | omnimount | mergerfs version: 2.33.5 omnimount | omnimount | Starting vnstat omnimount | omnimount | Configuration: omnimount | MERGEMOUNT=cloud omnimount | RCLONESERVICE=gcrypt omnimount | RCLONEMOUNT=gcrypt omnimount | UNSYNCED=unsynced omnimount | UPLOADCACHE=uploadcache omnimount | MEDIA=media omnimount | TURBOMAX=10 omnimount | addgroup: The group omniuser' already exists. omnimount | adduser: The useromniuser' already exists. omnimount | Starting services omnimount | OmniMount Caching: enabled omnimount | { omnimount | "jobid": 1 omnimount | } omnimount | omnimount | Startup complpete omnimount | fusermount: extra arguments after the mountpoint omnimount | Received request to shutdown omnimount | umount: /mnt/gcrypt: target is busy. omnimount | rmdir: failed to remove '/mnt/gcrypt': Device or resource busy omnimount | rmdir: failed to remove '/mnt/uploadcache': Directory not empty omnimount exited with code 143 omnimount exited with code 0

mahogl commented 1 year ago

Short update: Changing cache size or off did not make any difference, same with the Turbosync.

Changed the healthcheck, so lets see how that works out.

healthcheck: test: ["CMD-SHELL", "/omnimountcheck"] interval: 10s timeout: 5s retries: 3

mahogl commented 1 year ago

Looks like the trick with the healthcheck is working, no restart since the change to the configuration.

kelinger commented 1 year ago

@mahogl - Is the only change you made in retries (3 instead of 1) ?

mahogl commented 1 year ago

@kelinger - Yes, only change is 3 instead of 1, because my theory was that during heavy I/O operations it would fail one healtcheck and then do a restart of the Omnimount and then i got cascading restarts going on other containers that depended on Omnimount.

The results from the testing is still good no restarts since 28.10.22. I have also tried to get it to crash with stress test with heavy I/O operations like downloading and importing whole tv-series

kelinger commented 1 year ago

I'm unclear whether Docker's use of the term "retries" includes the initial attempt. In other words, my logic was that "1 retry" meant that it failed twice in a row (failed on a health check and then failed again on the retry 10 sec later). Obviously, if "retries = 1" means "one failure" then absolutely... there's no room for any delayed responses or false positives.

I've switched the project over to use 3 for this now so if you want to revert to the "official version" (which is updatable) feel free. It's possible I may reduce it to 2 but see no reason to do so now (the obvious drawback is that a true failure will now take 30-40 seconds to be "seen" meaning that other containers could be writing to useless directories or that Plex, et al, makes the decision to remove your library since it can't be found anymore and this would be the best reason to decrease the number of attempts)

Anyhow, nice find and it's now part of the official build!

mahogl commented 1 year ago

Sounds good. From the documentations that i have read it means, retries specifies the number of consecutive health check failures required to declare the container status as unhealthy. Health check will only try up to the specified retry number. If the server fails consecutively up to the specified times, it is then considered unhealthy.

Default settings from docker is 3 so think it`s fine setting at that, i have not experienced any problems with it. Thank you for including it :)

Source: https://docs.docker.com/engine/reference/builder/#healthcheck

kelinger / OmniStream

Omnimount restarts #28