Open oniony opened 9 years ago
I've implemented the algorithm for generating the fingerprints without any problem. However using the fingerprints in a repair is a bit more problematic: to identify where a directory has moved to would require fingerprinting every directory in the search paths specified which would potentially be very expensive.
You might think this would be a preexisting problem when repairing files but TMSU is able to build a shortlist of candidates by only considering files with an identical size as a file cannot be identical to another if its filesize is different. The size check (stat) is a relatively cheap operation compared to calculating a fingerprinting.
A synonym to the file size shortcut might be to consider the number of items within the directory. This might be cheap if the directory is small but could potentially be more expensive than the fingerprint calculation if the directory has millions of items. It might be I would have to cap the number of items just like the directory fingerprint algorithm stops if the directory has too many items.
If the idea is just to notice when a directory moves somewhere, perhaps what you could do is to add a file .tmsuid
in that directory containing a unique id + device and inode number of that file. This will be the directory identifier.
When the directory is moved somewhere else, the file stays with its inode and device number untouched (if on same filesystem). This can be detected.
When the directory is copied, it is also possible to detect it by noticing that the .tmsuid
file has change device and inode number, and is a copy.
If you don't care to detect copies vs renames, you don"t need to keep track of device and inode number.
@mildred Yes, that is one possible solution however it is not very user-friendly: one would have to remember to add this meta-data to each directory ahead of time. I would prefer to come up with a solution that would transparently detect directory moves/renames.
I think the best solution (as in most transparent and requiring no up-front participation from the user) would be to shortlist candidate directories based upon the number of directory entries or their aggregate size. This should be relatively cheap to calculate as it would only require a (perhaps recursive) directory enumeration.
Well, I was suggesting that tmsu would create this file. Perhaps this might be a little bit invasive, but it would record directory identity better than the list of its files (that can change possibly).
Or perhaps, just record directory inode number as a hint.
I wouldn't want to use inodes as not every type of filesystem uses inodes.
With respect to the list of a directory's files changing: I would consider this no different than the contents of a file changing after the fingerprint has been calculated, i.e. it could be repaired in the same way using the repair
subcommand.
TMSU use to store directory fingerprints by recursively enumerating the directory, calculating fingerprints for every file within it, building a report of these fingerprints and then creating a further fingerprint of this report. For large directories this was particularly slow and not very intuitive to the user when a tagging operation was suddenly inexplicably slow.
However, I've since realised it might be sufficient to build a directory fingerprint based only on the sizes of the files within it. This should be pretty fast and provide a largely useful fingerprint for identifying duplicate or moved directories.