Reducing scan times on systems without native recursive watching

qkdreyer commented 5 years ago

Following up #81, I've noticed too that I need to wait for ~5s before changes propagate to the my docker sync container, but my alpha URL is a local path on macOS, and my beta URL is a docker path on the same machine. (Using docker desktop community 2.0.3.0 on edge channel)

I'd be delighted to help you find out what is causing this delay. We could start investigation using this sample screen recording : http://recordit.co/Bk3JlWaIwN

xenoscopic commented 5 years ago

Thanks for the recording, that's super helpful!

It looks like Mutagen's taking a few seconds to perform rescans, which is a bit long. As a rule of thumb, they should take about as long as a git status command on a repository of comparable size.

Can you tell me which filesystems are being used on each endpoint? I'd assume APFS on macOS? Is the Docker container using some type of virtual or network filesystem?

Also, can you give me an idea of file counts (e.g. with find . | wc -l) and size (e.g. with du -sh .) within the synchronization root?

One good place to start with debugging would be to run the scan_bench tool that can be found in the Mutagen repository. It can be build with Go 1.11+ using go build scripts/scan_bench.go and run as ./scan_bench <path>. Its output will give more information on scan sizes and timings.

One quick idea would be to ignore node_modules directories (if any) from synchronization (just a guess based on your screencast). These generally aren't particularly useful to synchronize, especially if they have any native extensions built that aren't cross-platform.

qkdreyer commented 5 years ago

Can you tell me which filesystems are being used on each endpoint? I'd assume APFS on macOS? Is the Docker container using some type of virtual or network filesystem?

I'm indeed using AFPS. About the docker container, I've been following https://github.com/totara/totara-docker-dev/pull/34 in order to make it work, which looks like :

CONTAINER_PREFIX=app
CONTAINER_PATH=/var/www/app

services:
  sync:
    image: alpine:latest
    container_name: ${CONTAINER_PREFIX}.sync
    command: tail -f /dev/null
    working_dir: ${CONTAINER_PATH}
    volumes:
    - ${CONTAINER_PREFIX}.sync:${CONTAINER_PATH}:nocopy
volumes:
  app.sync:

Also, can you give me an idea of file counts (e.g. with find . | wc -l) and size (e.g. with du -sh .) within the synchronization root?

This was already displayed in my screencast :

find . | wc -l
  117022

du -sh .
2,3G    .

In fact, I've 3 big folders :

node_modules : 15616 files
vendor : 42424 files
public : 55428 files (mostly images)

markshust commented 5 years ago

chiming in here. also seeing 3-5 second delays between file sync propagation with docker. assuming this is due to large filesystems (using magento 2 here).

looking for a way to set and forget. i'd like to ignore vendor folder and similar, but still need it to sync from time to time. it would be nice to control the priority of specific files & folders, so writes from certain folders are pushed to the end of the sync stack and updates in others are pushed to the top (not sure if that is possible).

app@40508b3fdef9:~/html$ find . | wc -l
106344

app@40508b3fdef9:~/html$ du -sh .
795M    .

app@40508b3fdef9:~/html/vendor$ find . | wc -l
75775

app@40508b3fdef9:~/html/vendor$ du -sh .
523M

markshust commented 5 years ago

yep -- my suspicions were correct. if i ignore vcs and vendor folders, reloads are instantaneous:

--ignore=vendor --ignore-vcs

xenoscopic commented 5 years ago

@markshust (and @qkdreyer) There are some changes (037e713, 858f908, db2494f) coming in v0.9.0 that will add parallel fstatat invocation on multi-core systems. This adds a fairly significant performance boost. Fortunately for future-proofing but unfortunately for performance, Go 1.11 and 1.12 progressively switched over from raw syscalls to routing through libSystem on macOS, which hurt the performance of Mutagen's filesystem scans significantly (making them take about twice as long), so these changes will recoup a large portion of that loss.

Even with these changes, ignoring content (especially platform-specific content like node_modules/ or virtualenv directories) is definitely going to be the way to go to get faster reloads. --ignore-vcs is highly recommended (see here for technical reasons and #76 for an additional warning) and will probably become the default in a coming release.

Toilal commented 5 years ago

I wrote a python application mutagen-helper, just released today, that wraps mutagen binary and can help to manage your sessions.

Put configuration in a yaml file in a directory to sync, and run mutagen-helper up / mutagen-helper down to create / terminate sessions.

This could help people having performance issues by splitting a sync session into multiple sessions and multiple configuration files, and starting/stopping them when required with a simple command.

xenoscopic commented 5 years ago

I'd like to give a quick update on performance efforts here, as well as a summary of the current Mutagen bottlenecks and the plans to alleviate them (CC @saulfautley).

First, Mutagen v0.9.0-beta2 is now available and has a number of optimizations, fixes, and features that I hope will alleviate some of the performance woes that you've experienced. The most relevant change is the option to perform accelerated scans. This feature is still experimental, so it's not enabled by default, but it's easy to turn on for a session with --scan-mode=accelerated. On systems with native recursive watching (macOS and Windows), watch data will be used to avoid rescanning the disk whenever possible. On systems without native recursive watching (e.g Linux and the BSDs), where poll-based watching is enabled (which it is by default), the last filesystem scan generated by the watching will be returned immediately (instead of waiting for a new scan). This should make synchronization significantly more responsive.

More generally, the performance issues with multi-GB and high-file-multiplicity synchronization roots in Mutagen generally come down to two things:

1) Bandwidth limitations when transferring files initially 2) Creating a full snapshot of the synchronization root on each synchronization cycle (which is unfortunately a requirement of Mutagen's safety-conscious three-way merge algorithm)

There are other areas where Mutagen's performance could be improved, but these are the big, O(n), user-perceptible performance issues.

There's not much that Mutagen can do about the first issue. It already transfers changes as efficiently as possible using the rsync algorithm, but an initial sync of GBs of files is going to take as long as it takes given bandwidth constraints. About the only optimization that Mutagen could potentially do here would be to switch to a raw TCP-based transport (cutting out the overhead of ssh or docker exec). That's something that's on the horizon, but it would be a marginal (~20%?) improvement. The best way to solve this issue is to pre-populate files on the remote and then use Mutagen to keep them synchronized. There's just no faster way to push that much data through a pipe with bandwidth on the order of MB/s.

The second issue is the real performance pain point with Mutagen, and it's where most of the optimization focus has been. The accelerated scans mentioned above lay the foundation for fixing this problem, but they're only truly helpful on systems with native recursive watching. Systems without native recursive watching (Linux et al.) simply don't have the facilities to ensure total, race-free watching of a synchronization root, and thus a full rescan (i.e. recursive usage of readdir and fstatat) is required to ensure that Mutagen has an accurate picture of what's on disk. If we can find a mechanism for recursive watching on these systems, then we can easily integrate it with the accelerated scan infrastructure and bring scan times down by orders of magnitude.

Programs like Watchman simulate recursive watching on these systems by starting and stopping non-recursive watches based on other watches, but the process is an approximation and prone to missing events. Even after years of work, it's not perfect. Moreover, systems without native recursive watching generally have low default limits on the number of watches that they can establish. On Linux, each inotify watch requires a watch descriptor, and the OS generally limits the number of active watch descriptors to a few thousand by default. This limit is also per-user, not per-process. On BSD systems, which use kqueue for watching, each watch requires a file descriptor, meaning that the maximum number of open files is what limits the watch count (and if reached, causes problems with the rest of the program). It is possible to increase these limits, but it would require manual intervention (and in some cases superuser permissions) to do. And then there's the problem of scalability... if you're talking about individual watches on tens or hundreds of thousands of individual files, which is what we're talking about here, then you're looking at significantly straining system resources, probably beyond what Mutagen's rescans do.

My feeling is that approximating recursive watching on Linux (and other platforms) is a non-starter due to the fact that it would be prone to missing events, wouldn't scale to the number of files that we're talking about here, and would require manual superuser intervention to scale at all.

I think that the best option is for Mutagen to attempt to use Linux's fanotify API to perform recursive watching. The reason that Mutagen doesn't do this now is that fanotify is (or at least was) extremely limited in terms of the events that it can detect. It also requires superuser permissions. However, Linux 5.1 significantly expanded the fanotify API, adding more granular events. Additionally, since much of Mutagen's target use case is running inside containers, where root access isn't outside the realm of possibility, it might be possible to access this API.

If we can make fanotify work, and that's definitely my next avenue of research, then I think that these scan performance problems will largely evaporate. It won't help things on BSD systems and other platforms, but I think it will cover the 99.9% case. It will probably be Linux 5.1+ only (falling back to existing mechanisms on earlier systems), but then it's just a matter of waiting for that to roll out. It may not help RHEL/CentOS/Debian systems with ancient kernels, but it will become rapidly available for container environments with their more modern kernels.

Beyond that, there are a few other avenues of approach that I've been considering:

Using cgo to perform readdir/fstatat quicker. I've prototyped this, but the results weren't worth the added complexity and fragility, and it's a non-starter on Linux (unless we statically link musl) because of symbol versioning and the various libc implementations.
Performing fstatat in parallel. I've also prototyped this, and it showed some promise (reducing scan times by about 30%), but the implementation is complex and its scalability would need to be better understood. The POSIX spec also doesn't guarantee that it's safe to call fstatat concurrently with the same directory file descriptor, so it's a little risky in that sense.
Multiple sessions or "subsessions" of some sort. @Toilal touched on this above. The idea would be prioritizing certain subtrees of the synchronization root based on something like frequency of access. Mutagen could hypothetically do some sort of change histogramming and attempt to identify locations that could benefit from being isolated as subsessions, but reliably automating this detection would be extremely complex, and I think that the user of the codebase would already be in a good position to inform Mutagen about how to chop things up. This could perhaps be done through through flags to the create command (e.g. --subroot=assets --subroot=src ...), but I think that it would be better accomplished via scripting or a tool like @Toilal has created (thanks!).
More aggressive acceleration. The current scan acceleration algorithm is conservative since the acceleration is experimental, but there are a few optimizations that I can add once the current implementation is better tested.

So that's the state of things. fanotify seems promising for the first time ever, the acceleration infrastructure is now there if an fanotify watching implementation can be created, and there are certainly more minor optimizations that can come once the big stuff is out of the way. In the mean time, there are certainly user-informed heuristics that can reduce synchronization latency, e.g. creating multiple synchronization sessions and ignoring content that needn't be or shouldn't be synchronized.

At the end of the day though, there will still be limits. Synchronizing 100,000 files that add up to GBs is going to take time. But if Mutagen strives for an implementation where those cases are network-latency-limited, then everything else is going to be instant as well.

xenoscopic commented 2 years ago

Just one final update on this issue:

As far as optimizations go, I think the scanning is about as optimized as it can be on systems without recursive watching mechanisms like FSEvents, ReadDirectoryChangesW, and fanotify, so I don't think there's anything else to do on the original issue here. The performance should have improved significantly since Mutagen v0.8, due to both changes in Mutagen itself and the Go runtime.

Also, as of Mutagen v0.14, there is also support for using fanotify watching inside containers, which does drastically reduce scan times (and resource usage) on Linux. It's currently only activated automatically by Mutagen Compose, but it can be used by other containerized setups as well. This support will probably expand over time.

mutagen-io / mutagen

Reducing scan times on systems without native recursive watching #87