pkolaczk / fclones

Efficient Duplicate File Finder
MIT License
1.84k stars 70 forks source link

`fclones` re-scans hard links #177

Closed aseering closed 1 year ago

aseering commented 1 year ago

I have an backup drive that stores backups created using rsync, where each backup is a full copy of the directory tree but with each unmodified file hardlinked to the previous backup. This means that most files in the filesystem are hardlinks. (System is Linux/XFS.)

After grouping by paths and size, fclones seems to think that it has over 200TB of data to read. This takes a very long time and eventually runs out of memory (with 64gb RAM in the system).

The actual disk storage of the backups is only roughly 10TB. I assume what's happening is that fclones doesn't realize that hard links point to the same file data? In which case it's trying to scan the contents of each link to each file, rather than scanning each file and assuming (correctly) that each link to that file must have the same contents.

Does this assessment sound plausible? If so, is there a reason that fclones works this way, or would it be feasible to adopt this sort of optimization?

pkolaczk commented 1 year ago

That doesnt seem right if you're using a recent version of fclones. Avoiding multiple scanning of hard links has been already fixed, so if it doesn't work, that would be a bug.

See #142

As for running out of memory - how many files are you processing? The paths and checksums are kept in memory so if there are millions of fikes, 64 MB won't be enough.