suggestion to improve runtime significantly for large file/data sets

pixelb commented 9 years ago

Original issue 66 created by pixelb on 2011-06-11T20:31:54.000Z:

As I read the code now, the algorithm to find duplicates is exceptionally thorough, walking through:

any files of non-equal length
any files already hard linked
a partial md5 hash (I think?)
a full md5 hash
a full sha hash

The good news is that this makes it about 99.9999999% likely that you will never mark two different files as duplicates. The bad news is that for each file, it requires 3 separate file accesses, where two of them must read the entire file. One of the things I use fslint for is to identify duplicate video clips, many of which are multiple GB in size.

I would like to propose an option where the user can change the algorithm to something like this:

if the file size is greater than some configurable value, then instead of running a full md5 and sha hash on it, you choose a block (maybe 4MB?) of the file from the front of the file, a block from the end of the file, and a block from the middle, and run your hashes on only those blocks in an attempt to find duplicates.

If we look at the example of a 1GB file, this means that we'd run something like:

md5 file1( 0 thru 4MB ) vs md5 file2( 0 thru 4MB )

!!x is random between 0 and filesize md5 file1( X thru x+4MB ) vs md5 file2( X thru X+4MB )

!!y is the file size sha file1( Y-4MB thru Y ) vs file2 ( Y-4MB thru Y )

The main concept here is that the odds of having two files where:

the first 4MB have the same MD5 hash and a random 4MB block in the middle have the same MD5 hash and the last 4MB block has the same SHA1 hash

has got to be astronomically low. I'll grant you that the odds are not as low as the existing algorithm where you read every byte of the file, but I think the runtime improvement would be well worth it (I know that it would be worth it for me!)...

let's say that I have 100 files totaling 20GB of files that are actually duplicates... the script today is going to have to read and generate hashes for 20GB twice plus (100*the partial MD5) .. with the proposed algorithm the max amount read for any file is now 12MB (three 4MB blocks)... so this is going to run a couple orders of magnitude faster.

I hacked up something similar to this myself using 'dd' and 'md5sum' and it worked, but I think that this would be a relatively easy mod to make. If there's any way I can help, please let me know...

pixelb commented 9 years ago

Comment #1 originally posted by pixelb on 2011-06-11T20:51:07.000Z:

Thanks for the lucid suggestion.

For confirmation on the algorithm used see: http://fslint.googlecode.com/svn/trunk/doc/FAQ

You've no doubt looked at the findup shell script, so the first thing you can do is comment out the optional block of 7 lines around sha1sum.

As for sampling 2 files for comparison. Your random block in the middle suggestion is good. Well for video clips it would be good, but what about if one had daily backups of a database that only changed a little every day. One could easily get a false positive there, so such an operation would need to be enabled with an "approx match" option.

I'm surprised you have many duplicate videos. How does that keep happening to you mind me asking?

pixelb commented 9 years ago

Comment #2 originally posted by pixelb on 2011-08-03T07:47:42.000Z:

I also vote for some kind of speed up change. I regularly search my drives for duplicates that some how get there despite my best effort. I have 10 TB+ and between the shuffling and moving of data things happen.

I feel for starters to get rid of the MD5 hash, go straight to the stronger SHA1 or even higher. The disk is the slowest part, and reading the same file over and over doesn't fair well at my size data set. Reading 100 GB's of data is very slow, coupled with the fact that there is no progress bar doesn't help. See http://bitcache.org/faq/hash-collision-probabilities

Also I would like to vote for some kind of database or storage of file hashes. If the file has the same name, date, size and location assume it's got that hash (maybe with a small chunk SHA'ed as well.) While this might not work since once you delete duplicates why would another duplicate appear, maybe have a preemptive hasher that could run nightly for so long hashing a database to help speed up things in the future.

I'm not sure of the usefulness of this, but I do end up running fslint over and over again, perhaps something like Recoll or windows index service to help in the future.

pixelb commented 9 years ago

Comment #3 originally posted by pixelb on 2011-08-08T22:53:05.000Z:

caching would need to be done at the file system level I think. I've suggested this as a feature for BTRFS. I.E. have extended attributes (where we could store a hash), that was auto deleted when the data is modified. There wasn't much interest in this feature unfortunately. It would be easy enough to implement I think, but I'm probably not going to get time to do it.

I'm considering different ways to speed up hashing. The main tasks at present are to treat hard linked files efficiently, and to handle very large files efficiently

lodotek commented 8 years ago

Bump! Any recent development? I am trying to use fslint to clean up duplicates produced during data recovery processes. My data set is almost a 1TB, and fslint appears to be absolutely useless for me.

pixelb commented 8 years ago

@lodotek What's your data set? Lots of big files? Lots of same sized files? Lots of existing hardlinks?

claell commented 7 years ago

I am currently having the same problem with almost 4TB of video files. Is a checksum always created or only if the files have the same size?

pixelb commented 7 years ago

only if the files are the same size. Also we only checksum as much of files of the same size as needed.

claell commented 7 years ago

Alright, thanks for clarification.

pixelb / fslint

suggestion to improve runtime significantly for large file/data sets #65