High rescan time with watching disabled on endpoint

dbentley commented 7 months ago

We're using mutagen to hide underlying-FS latency issues for our cloud development servers. But it seems like mutagen has a weird 10+-second latency to do a sync?

More context:

users work against local SSD, which is fast but ephemeral.
the backing store is Lustre, which is slow for metadata operations, which especially slows down IDEs specifically and generally devtools like git or file scans. But it's persistent and durable against pod restarts.
We set up a mutagen one-way replica where alpha is local SSD and beta is lustre. We use --watch-mode-beta=no-watch so we don't have to scan lustre. I wrote a test that writes a file against alpha (SSD) and waits until it sees the same file in beta (lustre). The latency seems to be just over 10 or 20s, which makes me think there's some code path that waits for a timer or something? I used MUTAGEN_LOG_LEVEL=trace mutagen daemon run to measure timing info (along with log statements of the python process that writes and reads the files.) Any idea what's happening in the case below? Or how to reduce the latency? The line at 10:38:46.856329 is interesting; what caused that to happen then? Was mutagen waiting for the scan on beta? Thanks!

10:38:27,772 __main__:945 7 #0 wrote
2024-03-06 10:38:27.783688 [D] [sync.sync_cFcZqs97.alpha.polling] Received event-driven polling signal
2024-03-06 10:38:27.783755 [D] [sync.sync_cFcZqs97.alpha.polling] Performing filesystem scan
2024-03-06 10:38:27.979440 [D] [sync.sync_cFcZqs97.alpha.polling] Accelerated scanning now available
2024-03-06 10:38:27.999524 [T] [sync.sync_cFcZqs97.alpha.polling] Observed change at ".devstore_latency_test"
2024-03-06 10:38:27.999583 [D] [sync.sync_cFcZqs97.alpha.polling] Modifications detected
2024-03-06 10:38:28.019913 [D] [sync.sync_cFcZqs97] Triggered by alpha endpoint
2024-03-06 10:38:28.019969 [D] [sync.sync_cFcZqs97] Scanning endpoints
2024-03-06 10:38:28.019993 [D] [sync.sync_cFcZqs97.beta] Performing full scan
2024-03-06 10:38:28.020008 [D] [sync.sync_cFcZqs97.alpha] Performing accelerated scan with existing snapshot
2024-03-06 10:38:29.240392 [D] [sync.sync_cFcZqs97.alpha.polling] Received timer-based polling signal
2024-03-06 10:38:29.240449 [D] [sync.sync_cFcZqs97.alpha.polling] Performing filesystem scan
2024-03-06 10:38:29.424862 [D] [sync.sync_cFcZqs97.alpha.polling] Accelerated scanning now available
2024-03-06 10:38:29.445546 [D] [sync.sync_cFcZqs97.alpha.polling] No unignored modifications detected
2024-03-06 10:38:39.247096 [D] [sync.sync_cFcZqs97.alpha.polling] Received timer-based polling signal
2024-03-06 10:38:39.247135 [D] [sync.sync_cFcZqs97.alpha.polling] Performing filesystem scan
2024-03-06 10:38:39.442994 [D] [sync.sync_cFcZqs97.alpha.polling] Accelerated scanning now available
2024-03-06 10:38:39.464419 [D] [sync.sync_cFcZqs97.alpha.polling] No unignored modifications detected
2024-03-06 10:38:46.856329 [T] [sync.sync_cFcZqs97] Ancestor contains 35026 entries, alpha contains 35026 entries, beta contains 35026 entries
2024-03-06 10:38:46.869611 [D] [sync.sync_cFcZqs97] Performing reconciliation
2024-03-06 10:38:46.895228 [T] [sync.sync_cFcZqs97] Beta transition at ".devstore_latency_test" from File (Digest 6cbed10efd553cb01d9da65e81dd660388941b92) to File (Digest 7e64c47827b7b2ba9b3d5b236d488cdeb6f86c62)
2024-03-06 10:38:46.895255 [D] [sync.sync_cFcZqs97] Staging 1 file(s) on beta
2024-03-06 10:38:46.908938 [D] [sync.sync_cFcZqs97] Transitioning beta
2024-03-06 10:38:46.931516 [D] [sync.sync_cFcZqs97] Saving ancestor
#0 received
10:38:47,816 __main__:952 7 #0 received
#0 stats strategy=mutagen mutagen_sync_to_backing_fs_latency=20.04403499502223

xenoscopic commented 7 months ago

Hi @dbentley, thanks for the logs and information, sorry for the slow reply.

I suspect that the issue is due to the full filesystem rescan required on the Lustre side when specifying --watch-mode-beta=no-watch. If watching is enabled for an endpoint (even the poll-based watching that would be used on the beta endpoint in this case), then Mutagen will be able to return the result of a rescan much more quickly. For recursive watching (what's used on alpha), this rescan will be done by checking only dirty paths (the so-called "accelerated" scanning), which is fast. For non-recursive watching (what would normally be used on beta), this rescan will just be skipped and the last snapshot generated by scanning will be returned (with JIT checks to make sure it's not stale when applying changes).

There are really only two options in this case:

Re-enable watching on the beta endpoint. This will incur additional CPU and filesystem overhead, and may still see the occasional blocking when the background filesystem polling blocks a scan request, but the rescans should be much quicker.
Try to enable fanotify (recursive watching) support. This is quite complex and will require patching Mutagen because it's highly dependent on the environment in which it's being used, but the code is available to do this in the official releases.

xenoscopic commented 7 months ago

One thing we could think about doing is still caching the scan results from the last scan, even if watching is disabled, so long as --scan-mode=full isn't specified. Let me look in to what would be involved with that, it might be a quick win.

dbentley commented 6 months ago

"if watching is enabled for an endpoint"

But I said watch-mode=no-watch ? Is there a way to disable watching?

(Thanks so much)

xenoscopic commented 6 months ago

But I said watch-mode=no-watch ? Is there a way to disable watching?

You've got it right, --watch-mode=no-watch (or the equivalent YAML specification) is the correct way to disable watching. One of the consequences of disabling watching on an endpoint is that it will return scans more slowly. In Mutagen, the purpose of the watching is two-fold: first, it triggers synchronization cycles, and second, it allows for faster rescans of the filesystem. So in this case, disabling it is saving background CPU/disk overhead, but it's also forcing (nearly) full rescans of the filesystem on each synchronization cycle.

(Apologies if I'm not grokking what you're asking, but please don't hesitate to follow-up if I'm totally missing the question)

dbentley-ant commented 6 months ago

In a one-way sync, why do I need to scan beta at all? Like, I'd be fine with just copying the delta straight over.

and as I type that, I realize that's probably what you meant by caching scan results. then yes, that would be great. I'd even be fine with another scan-mode-beta="initial_only" or something if the explicit new option made it easier. Thanks so much.

xenoscopic commented 6 months ago

The reason for the scan (even in one-way) is so that Mutagen knows what changes it needs to propagate (i.e. to compute the delta), since it doesn't know if anything has changed while it wasn't watching.

dbentley commented 6 months ago

Yeah, thanks. I would happily bring in a huge stack of bibles and swear on them that it will not change (or if it does, that I will hold mutagen blameless).

If there's a change you think is easy to make, I'm happy to take a crack at it if you could outline the change? (also happy to pair or videochat to get a quick intro and then try to make the change)

xenoscopic commented 6 months ago

Hey, sorry for the slow reply (I've been traveling for work the last two weeks). I'd have to dig in a little bit and see how easy this change would be to make. It should be possible to avoid the rescans pretty easily, but there may be some logic that needs to be added to the endpoint constructor to at least perform one initial scan when watching is disabled. I've got it on my list to have a look at next week.

dbentley-ant commented 6 months ago

No worries; lmk if a pairing session would help.

mutagen-io / mutagen

High rescan time with watching disabled on endpoint #492