jessek / hashdeep

Other
708 stars 132 forks source link

Hash of a whole drive #375

Open larytet opened 7 years ago

larytet commented 7 years ago

My goal is to hash all files on a HDD in 0.5-1T range. What is my bottle neck going to be - CPU or I/O? Does it make sense to try to read and hash the physical sectors on the hard disk, and map the hashes to the files in the end of the process using tool like debugfs?

If my drive is a high end SSD - does it change the equation?

Thanks

larytet commented 7 years ago

Found this https://crypto.stackexchange.com/questions/46469/is-hashing-large-files-cpu-or-i-o-bound

larytet commented 7 years ago

In Linux there is https://linux.die.net/man/8/debugfs I can read the drive sector by sector, map sectors to files, feed SHAs machines and, eventually, get an SHA for every file on the disk without doing open-read-close. Or so it appears. What do I miss?

What about Windows?

jessek commented 7 years ago

My guess is that you will be I/O bound. This is a WAG, however, and not based on any information specific to your system. I also believe, but also don't have any evidence to support, that you will spend more time writing and debugging a system to read sector by sector and then reconstructing files, than you would take just reading the files the regular way.

larytet commented 7 years ago

The goal is to run on 100s of 1000s machines and VMs. In my case the performance is critical, development efforts are not.

jessek commented 7 years ago

If you have the time, you're welcome to go for it. Please let me know how it goes!

keybreak commented 5 years ago

@jessek I also have to hash whole drives a lot on Linux, like 3,7 Tb x3 drives full of mixed types of data... Which takes a really long time with md5.

How about implementing some super-fast algorithm, like xxHash for such goal of purely checking for data integrity?

HaleTom commented 5 years ago

If you are CPU bound, you may want to look at xxhash.

xxhash is probably the fastest hash algorithm today. Combined with a filesize, a 64-bit hash is more than enough for non-cryptographic purposes.