jtapolczai / FileSync

Syncrhonising file trees
Apache License 2.0
1 stars 0 forks source link

Avoid copying if a file can be moved. #25

Open jtapolczai opened 8 years ago

jtapolczai commented 8 years ago

We should store the hashes (or at least the names) of each file.

If a file should exist at place X in tree R but already exists in place Y (X != Y), then we should avoid copying it from another tree and just move it.

In practice, a fingerprint F(X) should be able to be computed cheaply and F(X) = F(Y) should strongly imply (or guarantee) that X = Y.

We'd have to store the hashes in a map F(X1) -> {X1,...,Xn} s.t. F(X1) = ... = F(Xn). Then, when we encounter a copy action for some Fi, we instead move a Fj (i != j).

jtapolczai commented 8 years ago

Rabin's fingerprinting algorithm looks good: https://en.wikipedia.org/wiki/Rabin_fingerprint

Since speed is the key consideration, we don't have to guarantee that (F(X) = F(Y)) <=> X = Y.

The two classes or errors:

  1. F(X) != F(Y) but X = Y. In this case, we just take the performance hit of copying X from another location.
  2. F(X) = F(Y) but X != Y. This is more problematic. We'd have to perform a second (also cheap) check to exclude the possibility of X!=Y.