pkolaczk / fclones

Efficient Duplicate File Finder
MIT License
1.91k stars 71 forks source link

`fclones dedupe` results in updates to mtimes on directories #93

Closed ivan closed 2 years ago

ivan commented 2 years ago

Running fclones dedupe on some directory tree results in mtimes being updated for directories containing files that were deduplicated.

I don't know whether this should be addressed, because while mtimes may be desirable to preserve, the directories really were updated through file creation. This effect does make it less likely that I would want to use fclones dedupe on old directory trees with potentially informative mtimes, though.

th1000s commented 2 years ago

Given the dedupe action is almost a userspace-transparent operation I agree that mtimes timestamps should not be affected.

Where is the right place to collect the timestamps beforehand? If placed inside reflink() it would stat directories multiple times, and returning the result io::Result<..> is probably annoying. Though maybe a concurrent hashmap could be passed along to skip that step if possible.

Where could it be done earlier? The iterator passed to run_script() is lazy so the stat'ing has to be woven somewhere into it - and it must reduce the result to unique directories. Not sure if this "dual flow" of files and directories is really possible.

Then the space-efficient fclones::Path has to be converted to a PathBuf to do the stat - should this be kept or should the conversion be repeated later when restoring the timestamps? Also, all other operations don't need that save-restore step at all, so skipping all this should not be expensive.

pkolaczk commented 2 years ago

If placed inside reflink() it would stat directories multiple times

I implemented the simplest solution for now - I placed restoring in the reflink. I believe the performance is not a big problem in this case - stat is way faster than actual deduplicating and stat called on the same file multiple times is cached by the OS.