pkolaczk / fclones

Efficient Duplicate File Finder
MIT License
1.87k stars 70 forks source link

--unique shows non unique files #148

Closed danlamanna closed 1 year ago

danlamanna commented 1 year ago

Hi - I'm new to using fclones and have found it to be a huge improvement over the other duplicate file finders out there. However, I'm having a hard time making sense of the --unique flag. For example,

mkdir foo
echo a > foo/a
echo b > foo/b
echo c > foo/c
echo c > foo/copy_of_c

Output of fclones group --unique foo:

[2022-08-25 16:03:47.382] fclones:  info: Started grouping
[2022-08-25 16:03:47.384] fclones:  info: Scanned 5 file entries
[2022-08-25 16:03:47.384] fclones:  info: Found 4 (8 B) files matching selection criteria
[2022-08-25 16:03:47.384] fclones:  info: Found 0 (0 B) candidates after grouping by size
[2022-08-25 16:03:47.384] fclones:  info: Found 0 (0 B) candidates after grouping by paths
[2022-08-25 16:03:47.386] fclones:  info: Found 2 (4 B) candidates after grouping by prefix
[2022-08-25 16:03:47.386] fclones:  info: Found 2 (4 B) candidates after grouping by suffix
[2022-08-25 16:03:47.386] fclones:  info: Found 2 (4 B) unique files
# Report by fclones 0.27.0
# Timestamp: 2022-08-25 16:03:47.386 -0400
# Command: fclones group --unique foo
# Base dir: /home/dan
# Total: 8 B (8 B) in 4 files in 3 groups
# Redundant: 0 B (0 B) in 0 files
# Missing: 4 B (4 B) in 2 files
6f973377854c3f70db84707e1de8d1a0, 2 B (2 B) * 1:
    /home/dan/foo/a
57f77e37a6de146f34541732cef23436, 2 B (2 B) * 2:
    /home/dan/foo/c
    /home/dan/foo/copy_of_c
13385bf32d48b5c03331333a6a16c7bd, 2 B (2 B) * 1:
    /home/dan/foo/b  

I'm surprised to be seeing c and copy_of_c at all. The csv format makes it easiest to distinguish the difference because of the file count column:

[2022-08-25 16:04:51.621] fclones:  info: Started grouping
[2022-08-25 16:04:51.622] fclones:  info: Scanned 5 file entries
[2022-08-25 16:04:51.622] fclones:  info: Found 4 (8 B) files matching selection criteria
[2022-08-25 16:04:51.623] fclones:  info: Found 0 (0 B) candidates after grouping by size
[2022-08-25 16:04:51.623] fclones:  info: Found 0 (0 B) candidates after grouping by paths
[2022-08-25 16:04:51.628] fclones:  info: Found 2 (4 B) candidates after grouping by prefix
[2022-08-25 16:04:51.628] fclones:  info: Found 2 (4 B) candidates after grouping by suffix
[2022-08-25 16:04:51.628] fclones:  info: Found 2 (4 B) unique files
size,hash,count,files
2,6f973377854c3f70db84707e1de8d1a0,1,/home/dan/foo/a
2,57f77e37a6de146f34541732cef23436,2,/home/dan/foo/c,/home/dan/foo/copy_of_c
2,13385bf32d48b5c03331333a6a16c7bd,1,/home/dan/foo/b  

Though it's still dependent on me doing a filter of the output. This is complicated by the CSV not escaping the commas, so typical CLI tools consider it an invalid CSV (would you accept a PR quoting the files column?).

Is it expected to display non-unique files in the output of group --unique? I had expected it to only produce groups of files of size 1, the inverse of the normal behavior.

Thanks!

pkolaczk commented 1 year ago

Thank you for reporting! Good catch!