mutagen-io / mutagen

Fast file synchronization and network forwarding for remote development
https://mutagen.io
Other
3.44k stars 152 forks source link

Resolving conflicts based on timestamps #149

Open zyuzka opened 5 years ago

zyuzka commented 5 years ago

Hello!

Question: Is it possible to make two-way sync with prio based on timestamp?

In my case, I have a docker container with PHP. All generation operations are performed in this container and must be synchronized with the host(Beta wins Alfa). But all code changes are made on the host(Alfa wins Beta).

two-way-safe doesn't suit us. All conflicts should be fixed automatically. two-way-resolved doesn't suit too. Alfa shouldn't win always.

lukasluecke commented 5 years ago

You can just create two syncs with different folders (or different ignore patterns, if these files are in the same folders), or is that not working for your setup?

xenoscopic commented 5 years ago

Hey @zyuzka! There's not currently a mode to resolve based on timestamp. I suppose such a thing could be implemented, though it would take quite a bit of work to implement, and I'm not sure if it would help in most cases due to clock skew between systems (not to mention the complexity of dealing with timezones and differing timestamp behaviors).

Can you tell me a bit more about the nature of the conflicts that are arising in your case with two-way-safe? For example, is there some operation you perform (like a git checkout) that's causing a lot of conflicts to occur? Mutagen's content-based change tracking should generally work fine for doing what you want (i.e. changing code on your host and having generated files propagated back to the host), but there may be scenarios where @lukasluecke's suggestion is the right way to go (e.g. if changes from the container are isolated to a specific directory).

xenoscopic commented 4 years ago

I've been thinking about this a lot and I think I'm going to avoid implementing this for the foreseeable future. There are a number of issues that make it both difficult to implement and less useful in practice.

The first issue is clock drift between systems. Admittedly, with NTP, the deltas between systems should be pretty small, but for files that are updated in an automated fashion on both endpoints (which tend to be the cause of conflicts in bidirectional synchronization), there's a very realistic possibility that conflicts could arise with a timestamp delta that's less than the time delta between the two systems. There may be heuristics that could be used to account for this (e.g. limiting conflict resolution to cases where the timestamp delta is larger than the combined time uncertainty on the systems), but I think this will be both complicated to implement and so complex to use that it won't be much help in practice.

The second issue is that adding timestamps to filesystem scans will significantly increase their size, since it would almost certainly require 8 bytes per entry. This would probably yield a 30-50% increase in sizes, which would be an issue for the initial scan performed in each synchronization cycle (since it isn't deltafied like later scans) and would bloat memory usage significantly.

The third issue is purely one of implementation. This is really complex to implement in a way that's robust, primarily because of time zone handling and how it differs between systems (or even different configurations of the same system).

In the end, I think the cost of implementing this would far outweigh the value. It makes more sense in one-way tools like rsync, because rsync is actually setting the timestamps on the remote by just copying them over (meaning that clock drift and associated issues aren't a problem). In the case of Mutagen, I think the conflict resolution should really be determined by workflow rather than timestamp, so I'd still be interested to hear more about the workflow in this case.

xenoscopic commented 4 years ago

Just for posterity, I'd like to document another issue with this resolution that came up in a separate discussion:

One problem with this is that timestamp-based resolution isn't so well-defined that it can solve all conflicts (or even most). For example, assume that two systems have hyper-accurate and perfectly synchronized clocks and that the filesystem timestamps on those systems reflect those clocks. Now imagine that there's an identical directory tree on both sides with timestamp X for the root directory, and imagine that one side replaces it with a file (with timestamp X+1) and the other side edits a file deep underneath that directory tree which ends up with timestamp X+2. It's unclear in this case which should win. The file on one side is newer than the directory it replaces, but older than the directory's content on the other side. In this case, the timestamp-based resolution would need to raise a conflict, arbitrarily choose a winner, or use some sort of other well-defined behavior to resolve the situation. Raising a conflict and arbitrarily choosing a winner would put this on roughly the same footing as two-way-safe and two-way-resolved, respectively, although without the safety of their content-based tracking. Fully enumerating all other possible cases is probably doable, but doing so in a way that the heuristics match a majority of users' intuitions is almost certainly not possible (and we avoid trying to create such a hypothetical heuristic for content-based tracking for exactly the same reason).

xenoscopic commented 4 years ago

I'm going to re-open this for the time being for additional discussion. I still have my reservations about the idea (and I think the concerns listed above are still valid), but there may be useful behavior here for cases where Mutagen is taken offline for prolonged periods of time and its content-based merges need additional help in determining which endpoint should win a conflict. For more discussion about these cases, please see this thread.