pkolaczk / fclones

Efficient Duplicate File Finder
MIT License
1.87k stars 70 forks source link

Already hardlinked files are read more than once #139

Closed hans-helmut closed 2 years ago

hans-helmut commented 2 years ago

Hello,

files which are already hardlinked do not need to read twice ore more. Expected in this test is, that 20 GB are read, not 40 GB. Patterns like this are caused by backups using cp and rsync, where hundreds of hardlinks are not uncommon.

me@pc:~/tmp$ mkdir a ; mkdir b ; dd if=/dev/urandom of=a/1 bs=1M count=10k ; cp a/1 a/2 ; ln a/1 b/1 ; ln a/2 b/2 
10240+0 Datensätze ein
10240+0 Datensätze aus
10737418240 Bytes (11 GB, 10 GiB) kopiert, 33,2262 s, 323 MB/s
me@pc:~/tmp$ fclones group a b
[2022-06-11 16:31:02.355] fclones:  info: Started grouping
[2022-06-11 16:31:02.360] fclones:  info: Scanned 6 file entries
[2022-06-11 16:31:02.360] fclones:  info: Found 4 (42.9 GB) files matching selection criteria
[2022-06-11 16:31:02.361] fclones:  info: Found 3 (32.2 GB) candidates after grouping by size
[2022-06-11 16:31:02.361] fclones:  info: Found 3 (32.2 GB) candidates after grouping by paths and file identifiers
[2022-06-11 16:31:02.363] fclones:  info: Found 3 (32.2 GB) candidates after grouping by prefix
[2022-06-11 16:31:02.364] fclones:  info: Found 3 (32.2 GB) candidates after grouping by suffix
Grouping by contents        [==========================================>       ]    34.28GB/40.00GB
^C
me@pc:~/tmp$ ls -li a/? b/?
8126615 -rw-r--r-- 2 me me 10737418240 11. Jun 16:29 a/1
8126616 -rw-r--r-- 2 me me 10737418240 11. Jun 16:30 a/2
8126615 -rw-r--r-- 2 me me 10737418240 11. Jun 16:29 b/1
8126616 -rw-r--r-- 2 me me 10737418240 11. Jun 16:30 b/2
me@pc:~/tmp$ 
pkolaczk commented 2 years ago

I think you can use --cache to avoid it.

pkolaczk commented 2 years ago

Anyway, this is a known issue. This started happening after I removed the code that prunes the hardlinks early in the process, so that all hardlinks can be reported in the output. However, I need to add some in-memory caching layer to avoid hashing the files we already hashed once.

hans-helmut commented 2 years ago

I think you can use --cache to avoid it. Well, not on the first run:


me@pc:~/tmp$ rm -rf ~/.cache/fclones/
me@pc:~/tmp$ time fclones group  a b
[2022-06-13 17:32:42.201] fclones:  info: Started grouping
[2022-06-13 17:32:42.205] fclones:  info: Scanned 6 file entries
[2022-06-13 17:32:42.205] fclones:  info: Found 4 (42.9 GB) files matching selection criteria
[2022-06-13 17:32:42.205] fclones:  info: Found 3 (32.2 GB) candidates after grouping by size
[2022-06-13 17:32:42.205] fclones:  info: Found 3 (32.2 GB) candidates after grouping by paths and file identifiers
[2022-06-13 17:32:42.209] fclones:  info: Found 3 (32.2 GB) candidates after grouping by prefix
[2022-06-13 17:32:42.216] fclones:  info: Found 3 (32.2 GB) candidates after grouping by suffix
[2022-06-13 17:32:54.937] fclones:  info: Found 3 (32.2 GB) redundant files
# Report by fclones 0.25.0
# Timestamp: 2022-06-13 17:32:54.939 +0200
# Command: fclones group a b
# Base dir: /home/me/tmp
# Total: 42949672960 B (42.9 GB) in 4 files in 1 groups
# Redundant: 32212254720 B (32.2 GB) in 3 files
# Missing: 0 B (0 B) in 0 files
32d0c7c0740d7b71703c5df2f89dce3d, 10737418240 B (10.7 GB) * 4:
/home/me/tmp/a/1
/home/me/tmp/a/2
/home/me/tmp/b/1
/home/me/tmp/b/2

real 0m12,759s user 0m3,318s sys 0m12,767s me@pc:~/tmp$ time fclones group --cache a b [2022-06-13 17:33:03.823] fclones: info: Started grouping [2022-06-13 17:33:03.846] fclones: info: Scanned 6 file entries [2022-06-13 17:33:03.846] fclones: info: Found 4 (42.9 GB) files matching selection criteria [2022-06-13 17:33:03.846] fclones: info: Found 3 (32.2 GB) candidates after grouping by size [2022-06-13 17:33:03.847] fclones: info: Found 3 (32.2 GB) candidates after grouping by paths and file identifiers [2022-06-13 17:33:03.850] fclones: info: Found 3 (32.2 GB) candidates after grouping by prefix [2022-06-13 17:33:03.850] fclones: info: Found 3 (32.2 GB) candidates after grouping by suffix [2022-06-13 17:33:18.567] fclones: info: Found 3 (32.2 GB) redundant files

Report by fclones 0.25.0

Timestamp: 2022-06-13 17:33:18.575 +0200

Command: fclones group --cache a b

Base dir: /home/me/tmp

Total: 42949672960 B (42.9 GB) in 4 files in 1 groups

Redundant: 32212254720 B (32.2 GB) in 3 files

Missing: 0 B (0 B) in 0 files

32d0c7c0740d7b71703c5df2f89dce3d, 10737418240 B (10.7 GB) * 4: /home/me/tmp/a/1 /home/me/tmp/a/2 /home/me/tmp/b/1 /home/me/tmp/b/2

real 0m14,768s user 0m3,632s sys 0m14,287s me@pc:~/tmp$ time fclones group --cache a b [2022-06-13 17:33:26.262] fclones: info: Started grouping [2022-06-13 17:33:26.285] fclones: info: Scanned 6 file entries [2022-06-13 17:33:26.285] fclones: info: Found 4 (42.9 GB) files matching selection criteria [2022-06-13 17:33:26.286] fclones: info: Found 3 (32.2 GB) candidates after grouping by size [2022-06-13 17:33:26.286] fclones: info: Found 3 (32.2 GB) candidates after grouping by paths and file identifiers [2022-06-13 17:33:26.289] fclones: info: Found 3 (32.2 GB) candidates after grouping by prefix [2022-06-13 17:33:26.290] fclones: info: Found 3 (32.2 GB) candidates after grouping by suffix [2022-06-13 17:33:26.292] fclones: info: Found 3 (32.2 GB) redundant files

Report by fclones 0.25.0

Timestamp: 2022-06-13 17:33:26.295 +0200

Command: fclones group --cache a b

Base dir: /home/me/tmp

Total: 42949672960 B (42.9 GB) in 4 files in 1 groups

Redundant: 32212254720 B (32.2 GB) in 3 files

Missing: 0 B (0 B) in 0 files

32d0c7c0740d7b71703c5df2f89dce3d, 10737418240 B (10.7 GB) * 4: /home/me/tmp/a/1 /home/me/tmp/a/2 /home/me/tmp/b/1 /home/me/tmp/b/2

real 0m0,047s user 0m0,006s sys 0m0,063s me@pc:~/tmp$