markfasheh / duperemove

Tools for deduping file systems
GNU General Public License v2.0
795 stars 80 forks source link

no dupes #239

Closed slesru closed 3 years ago

slesru commented 4 years ago

Hello!

Ubuntu 20.04, compiled current master:

./duperemove --version duperemove v0.12.dev

./duperemove -hdr eniki.img eniki1.img
Gathering file list... Using 8 threads for file hashing phase [1/2] (50.00%) csum: /home/dm/test/eniki.img [2/2] (100.00%) csum: /home/dm/test/eniki1.img Total files: 2 Total extent hashes: 3 Loading only duplicated hashes from hashfile. Found 0 identical extents. Simple read and compare of file data found 0 instances of extents that might benefit from deduplication. Nothing to dedupe.

These 2 files are the same...

Thank you!

lorddoskias commented 4 years ago

But do they have the same logical structure i.e the exat same extents, I suspect not.Can you prodive the output of filefrag -v for both files as well as the output of md5sum on both files. Also consult the man page about --dedupe-optioons=partial

slesru commented 4 years ago

I got these files by using cp eniki.img eniki1.img. So these files are identical:

dm@dm:~/ test$ filefrag -v eniki1.img Filesystem type is: 58465342 File size of eniki1.img is 34359738368 (8388608 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 131055: 6291822.. 6422877: 131056:
1: 131056.. 1048559: 11796510.. 12714013: 917504: 6422878: 2: 1048560.. 2097135: 18362909.. 19411484: 1048576: 12714014: 3: 2097136.. 3145711: 19660813.. 20709388: 1048576: 19411485: 4: 3145712.. 5242862: 20709389.. 22806539: 2097151:
5: 5242863.. 7340013: 22806540.. 24903690: 2097151:
6: 7340014.. 8388607: 24903691.. 25952284: 1048594: last,eof eniki1.img: 4 extents found `

dm@dm:~/test$ filefrag -v eniki1.img Filesystem type is: 58465342 File size of eniki1.img is 34359738368 (8388608 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 131055: 6291822.. 6422877: 131056:
1: 131056.. 1048559: 11796510.. 12714013: 917504: 6422878: 2: 1048560.. 2097135: 18362909.. 19411484: 1048576: 12714014: 3: 2097136.. 3145711: 19660813.. 20709388: 1048576: 19411485: 4: 3145712.. 5242862: 20709389.. 22806539: 2097151:
5: 5242863.. 7340013: 22806540.. 24903690: 2097151:
6: 7340014.. 8388607: 24903691.. 25952284: 1048594: last,eof eniki1.img: 4 extents found dm@dm:~/test$

md5sum /home/dm/test/eniki.img b409c344fcd2f49e74c9cac9d7b719fe /home/dm/test/eniki.img

md5sum /home/dm/test/eniki1.img b409c344fcd2f49e74c9cac9d7b719fe /home/dm/test/eniki1.img

btw, duperemove v0.11.1 works, but takes very long time, so I stopped it..

lorddoskias commented 4 years ago

You have provided the filefrag for eniki1.img twice. Also from the FAQ:

I got two identical files, why are they not deduped?

Duperemove by default works on extent granularity. What this means is if there
are two files which are logically identical (have the same content) but are
laid out on disk with different extent structure they won't be deduped. For
example if 2 files are 128k each and their content are identical but one of
them consists of a single 128k extent and the other of 2 x 64k extents then
they won't be deduped. This behavior is dependent on the current implementation
and is subject to change as duperemove is being improved.
slesru commented 4 years ago

sorry for mistake filefrag -v eniki.img Filesystem type is: 58465342 File size of eniki.img is 34359738368 (8388608 blocks of 4096 bytes) ext: logical_offset: physical_offset: length: expected: flags: 0: 0.. 2097135: 24.. 2097159: 2097136:
1: 2097136.. 3145711: 2097160.. 3145735: 1048576:
2: 3145712.. 5242862: 3145736.. 5242886: 2097151:
3: 5242863.. 7339998: 6553613.. 8650748: 2097136: 5242887: 4: 7339999.. 8388607: 8650749.. 9699357: 1048609: last,eof eniki.img: 2 extents found

Well, as I said I got this by using cp, I can't undrestand how can II get different layout. But, anyway, 2 different files:

time ./duperemove -hdr eniki.img beniki.img Gathering file list... Using 8 threads for file hashing phase [1/2] (50.00%) csum: /home/dm/test/eniki.img [2/2] (100.00%) csum: /home/dm/test/beniki.img Total files: 2 Total extent hashes: 2 Loading only duplicated hashes from hashfile. Found 0 identical extents. Simple read and compare of file data found 0 instances of extents that might benefit from deduplication. Nothing to dedupe.

real 0m23,003s user 0m3,479s sys 0m6,702s

Stops immediately, no comparision at all as you can guess by execution time...

lorddoskias commented 4 years ago

Yeah, so loooking at the layout of the files eniki.img has only 5 extents whereas eniki1.img has 7 extents. So you are naturally hitting a deficiency in the current implementation where dedup is performed only on exact extents. Alternatively you can run dedup which works on fixed blocksizes, which can be smaller than an extent in this case duperemove shall be able too find more commonality between the two files and possibly dedupe them. Read the FAQ I pasted above why this is happening.

slesru commented 4 years ago

Thank you very much!

lorddoskias commented 3 years ago

Closing as this is expected to not work at the moment.