Open dbentley opened 7 months ago
Hi @dbentley, thanks for the logs and information, sorry for the slow reply.
I suspect that the issue is due to the full filesystem rescan required on the Lustre side when specifying --watch-mode-beta=no-watch
. If watching is enabled for an endpoint (even the poll-based watching that would be used on the beta
endpoint in this case), then Mutagen will be able to return the result of a rescan much more quickly. For recursive watching (what's used on alpha), this rescan will be done by checking only dirty paths (the so-called "accelerated" scanning), which is fast. For non-recursive watching (what would normally be used on beta), this rescan will just be skipped and the last snapshot generated by scanning will be returned (with JIT checks to make sure it's not stale when applying changes).
There are really only two options in this case:
beta
endpoint. This will incur additional CPU and filesystem overhead, and may still see the occasional blocking when the background filesystem polling blocks a scan request, but the rescans should be much quicker.fanotify
(recursive watching) support. This is quite complex and will require patching Mutagen because it's highly dependent on the environment in which it's being used, but the code is available to do this in the official releases.One thing we could think about doing is still caching the scan results from the last scan, even if watching is disabled, so long as --scan-mode=full
isn't specified. Let me look in to what would be involved with that, it might be a quick win.
"if watching is enabled for an endpoint"
But I said watch-mode=no-watch ? Is there a way to disable watching?
(Thanks so much)
But I said watch-mode=no-watch ? Is there a way to disable watching?
You've got it right, --watch-mode=no-watch
(or the equivalent YAML specification) is the correct way to disable watching. One of the consequences of disabling watching on an endpoint is that it will return scans more slowly. In Mutagen, the purpose of the watching is two-fold: first, it triggers synchronization cycles, and second, it allows for faster rescans of the filesystem. So in this case, disabling it is saving background CPU/disk overhead, but it's also forcing (nearly) full rescans of the filesystem on each synchronization cycle.
(Apologies if I'm not grokking what you're asking, but please don't hesitate to follow-up if I'm totally missing the question)
In a one-way sync, why do I need to scan beta at all? Like, I'd be fine with just copying the delta straight over.
and as I type that, I realize that's probably what you meant by caching scan results. then yes, that would be great. I'd even be fine with another scan-mode-beta="initial_only" or something if the explicit new option made it easier. Thanks so much.
The reason for the scan (even in one-way) is so that Mutagen knows what changes it needs to propagate (i.e. to compute the delta), since it doesn't know if anything has changed while it wasn't watching.
Yeah, thanks. I would happily bring in a huge stack of bibles and swear on them that it will not change (or if it does, that I will hold mutagen blameless).
If there's a change you think is easy to make, I'm happy to take a crack at it if you could outline the change? (also happy to pair or videochat to get a quick intro and then try to make the change)
Hey, sorry for the slow reply (I've been traveling for work the last two weeks). I'd have to dig in a little bit and see how easy this change would be to make. It should be possible to avoid the rescans pretty easily, but there may be some logic that needs to be added to the endpoint constructor to at least perform one initial scan when watching is disabled. I've got it on my list to have a look at next week.
No worries; lmk if a pairing session would help.
We're using mutagen to hide underlying-FS latency issues for our cloud development servers. But it seems like mutagen has a weird 10+-second latency to do a sync?
More context:
MUTAGEN_LOG_LEVEL=trace mutagen daemon run
to measure timing info (along with log statements of the python process that writes and reads the files.) Any idea what's happening in the case below? Or how to reduce the latency? The line at 10:38:46.856329 is interesting; what caused that to happen then? Was mutagen waiting for the scan on beta? Thanks!