Closed slesru closed 3 years ago
But do they have the same logical structure i.e the exat same extents, I suspect not.Can you prodive the output of filefrag -v
for both files as well as the output of md5sum
on both files. Also consult the man page about --dedupe-optioons=partial
I got these files by using cp eniki.img eniki1.img. So these files are identical:
dm@dm:~/ test$ filefrag -v eniki1.img
Filesystem type is: 58465342
File size of eniki1.img is 34359738368 (8388608 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 131055: 6291822.. 6422877: 131056:
1: 131056.. 1048559: 11796510.. 12714013: 917504: 6422878:
2: 1048560.. 2097135: 18362909.. 19411484: 1048576: 12714014:
3: 2097136.. 3145711: 19660813.. 20709388: 1048576: 19411485:
4: 3145712.. 5242862: 20709389.. 22806539: 2097151:
5: 5242863.. 7340013: 22806540.. 24903690: 2097151:
6: 7340014.. 8388607: 24903691.. 25952284: 1048594: last,eof
eniki1.img: 4 extents found `
dm@dm:~/test$ filefrag -v eniki1.img
Filesystem type is: 58465342
File size of eniki1.img is 34359738368 (8388608 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 131055: 6291822.. 6422877: 131056:
1: 131056.. 1048559: 11796510.. 12714013: 917504: 6422878:
2: 1048560.. 2097135: 18362909.. 19411484: 1048576: 12714014:
3: 2097136.. 3145711: 19660813.. 20709388: 1048576: 19411485:
4: 3145712.. 5242862: 20709389.. 22806539: 2097151:
5: 5242863.. 7340013: 22806540.. 24903690: 2097151:
6: 7340014.. 8388607: 24903691.. 25952284: 1048594: last,eof
eniki1.img: 4 extents found
dm@dm:~/test$
md5sum /home/dm/test/eniki.img b409c344fcd2f49e74c9cac9d7b719fe /home/dm/test/eniki.img
md5sum /home/dm/test/eniki1.img b409c344fcd2f49e74c9cac9d7b719fe /home/dm/test/eniki1.img
btw, duperemove v0.11.1 works, but takes very long time, so I stopped it..
You have provided the filefrag for eniki1.img twice. Also from the FAQ:
I got two identical files, why are they not deduped?
Duperemove by default works on extent granularity. What this means is if there
are two files which are logically identical (have the same content) but are
laid out on disk with different extent structure they won't be deduped. For
example if 2 files are 128k each and their content are identical but one of
them consists of a single 128k extent and the other of 2 x 64k extents then
they won't be deduped. This behavior is dependent on the current implementation
and is subject to change as duperemove is being improved.
sorry for mistake
filefrag -v eniki.img
Filesystem type is: 58465342
File size of eniki.img is 34359738368 (8388608 blocks of 4096 bytes)
ext: logical_offset: physical_offset: length: expected: flags:
0: 0.. 2097135: 24.. 2097159: 2097136:
1: 2097136.. 3145711: 2097160.. 3145735: 1048576:
2: 3145712.. 5242862: 3145736.. 5242886: 2097151:
3: 5242863.. 7339998: 6553613.. 8650748: 2097136: 5242887:
4: 7339999.. 8388607: 8650749.. 9699357: 1048609: last,eof
eniki.img: 2 extents found
Well, as I said I got this by using cp, I can't undrestand how can II get different layout. But, anyway, 2 different files:
time ./duperemove -hdr eniki.img beniki.img Gathering file list... Using 8 threads for file hashing phase [1/2] (50.00%) csum: /home/dm/test/eniki.img [2/2] (100.00%) csum: /home/dm/test/beniki.img Total files: 2 Total extent hashes: 2 Loading only duplicated hashes from hashfile. Found 0 identical extents. Simple read and compare of file data found 0 instances of extents that might benefit from deduplication. Nothing to dedupe.
real 0m23,003s user 0m3,479s sys 0m6,702s
Stops immediately, no comparision at all as you can guess by execution time...
Yeah, so loooking at the layout of the files eniki.img has only 5 extents whereas eniki1.img has 7 extents. So you are naturally hitting a deficiency in the current implementation where dedup is performed only on exact extents. Alternatively you can run dedup which works on fixed blocksizes, which can be smaller than an extent in this case duperemove shall be able too find more commonality between the two files and possibly dedupe them. Read the FAQ I pasted above why this is happening.
Thank you very much!
Closing as this is expected to not work at the moment.
Hello!
Ubuntu 20.04, compiled current master:
./duperemove --version duperemove v0.12.dev
./duperemove -hdr eniki.img eniki1.img
Gathering file list... Using 8 threads for file hashing phase [1/2] (50.00%) csum: /home/dm/test/eniki.img [2/2] (100.00%) csum: /home/dm/test/eniki1.img Total files: 2 Total extent hashes: 3 Loading only duplicated hashes from hashfile. Found 0 identical extents. Simple read and compare of file data found 0 instances of extents that might benefit from deduplication. Nothing to dedupe.
These 2 files are the same...
Thank you!