pkolaczk / fclones

Efficient Duplicate File Finder
MIT License
1.84k stars 70 forks source link

Why does running a group after a dedupe hash everything again? #193

Closed patrickwolf closed 1 year ago

patrickwolf commented 1 year ago
Running Group on 130TB takes ~2 days
fclones group /ex2/ --cache -s 1M -o /ex2/_Data/fclones.json -f json --exclude '/ex2/#snapshot/**' --exclude '/ex2/#recycle/**' 

Running it again takes 15 minutes
fclones group /ex2/ --cache -s 1M -o /ex2/_Data/fclones.json -f json --exclude '/ex2/#snapshot/**' --exclude '/ex2/#recycle/**' 

Doing deduplication takes 10 minutes
fclones dedupe --path '/ex2/Reviews/**' -o /ex2/_Data/fclones_dd.txt --priority least-recently-modified < /ex2/_Data/fclones.json

**Running group again takes 1+ day**
fclones group /ex2/ --cache -s 1M -o /ex2/_Data/fclones.json -f json --exclude '/ex2/#snapshot/**' --exclude '/ex2/#recycle/**' 

Why is it that after a dedupe that files need to re-hashed? Running dedupe should have reduced the amount of files that are not the same instead of increasing it right?

Environment is Synology, BTRFS, 130TB RAID 5

th1000s commented 1 year ago

On Linux fclones did not restore the timestamps of deduped files. This means the cache - which among other information looks at the mtime - for these entries was invalidated. This should be fixed with #194.

patrickwolf commented 1 year ago

Great thank you @th1000s ! I imagine your internal code change is the same as using the -P option on cp? ie "cp --relink=always -P"?

th1000s commented 1 year ago

It is more like --preserve=timestamps, -p, or more practical -a / archive mode.

pkolaczk commented 1 year ago

This should be fixed now in 0.31.0.