jessek / hashdeep

Other
702 stars 132 forks source link

Flaw in hashdeep audit: zero size files are considered moved #334

Open chrisly42 opened 9 years ago

chrisly42 commented 9 years ago

Every empty file (size == 0) has the same (but meaningless) hash code. Therefore, during audit, a file is considered moved from one location to another, if there's another file with empty content.

This is a very weak assumption and should probably be ignored / filtered (optionally?).

simsong commented 9 years ago

Please call them "hash values," not "hash codes", since "hash code" implies the software which performs the hashing.

Zero-length files do not have meaningless hashes --- they have the hashes of zero length files. Are you suggesting that the one zero length file was deleted and another was created, rather than the file was moved?

Does the "move" code consider timestamps?

On Jun 15, 2015, at 1:43 PM, chrisly42 notifications@github.com wrote:

Every empty file (size == 0) has the same (but meaningless) hash code. Therefore, during audit, a file is considered moved from one location to another, if there's another file with empty content.

This is a very weak assumtion and should probably be ignored / filtered (optionally?).

— Reply to this email directly or view it on GitHub https://github.com/jessek/hashdeep/issues/334.

chrisly42 commented 9 years ago

AFAIK the timestamp is not considered for marking a file as "moved" but I didn't check the code. But taking the timestamp into account would definitely increase the weak assumption for zero byte files whether they have been moved.

I know that the hash eigenwuschel(TM) over an empty input string is a defined value (and often, implementations have bugs for this edge case). It is just that the hash value is not very meaningful and more* information content is already precisely given by the 1 bit information of size == 0. In this case, the 1 bit information fully describes the file content.

Hashdeep uses heuristics to assume identity of two files, and they can be wrong (e.g. hash eigenwuschel(TM) collisions). The heuristic is very weak regarding zero byte files. Improvement of the heuristics for this edge case could be implemented using: 1) the timestamp as you already mentioned 2) the Levensthein distance (or something similar) between the two file names 3) the node distance in the filesystem tree.

But still, you need to define an (experimental) threshold for the decision. Or, simply, offer an option to rather always assume delete / create operations for zero byte files vs. moving.