idrassi / DirHash

Windows command line utility to compute hash of directories and files
BSD 3-Clause "New" or "Revised" License
111 stars 11 forks source link

Many open file handles when verifying hashsums with parameter "-threads" #35

Closed groone closed 1 month ago

groone commented 1 month ago

When verifying blake3 hashsums of a large deeply nested directory with -threads parameter DirHash keeps open a very large number of file handles. In my use case I saw it keeping open over 9000 file handles. That caused out of memory error in the virtual drive of a popular cloud sync friendly encryption software "Cryptomator".

And second point: Documentation should be updated to note that -threads parameter can significantly slow down the verification if files are on a spinning disk. After trying it out I learned that my use case takes 6 hours of disk thrashing with -threads and 1.5 hours without it. Spinning disks should be accessed in sequential order instead.

Third point: When creating hashsum file of large deeply nested directory with -threads parameter, the file names are not in lexicographical order and diffs between hashsum files are not usable

idrassi commented 1 month ago

Thank you for reporting these issues.

For the first point, the issue is not caused by a handle leak in the implementation. The core issue lies in the fact that, although each file handle is eventually closed, the rate of handle creation by the main thread is significantly faster than the rate of handle closure by the worker threads. This can lead to a temporary buildup of open handles, especially when processing a large number of files. To resolve this, I will implement a change to close the handle in the main thread as soon as the hash job is queued in the thread pool, and then have the worker thread reopen the file when its processing begins.

For the second point, I agree that documenting this side effect of using -thread on mechanical disks is warranted. I'm also considering adding a detection mechanism in DirHash to identify such drives, so that a warning can be displayed when -threads is used.

For the third point, it is definitely a good idea to sort entries in the output SUM file. Currently, we write the result of hashing to the SUM file as they arrive from the thread pool, so the order is not guaranteed. To add ordering, there are two approaches:

I think I will go with the post-processing approach since it has a lower memory footprint and is simpler to implement.

I will let you know when I implement these changes.

idrassi commented 1 month ago

@groone I have implemented changes to address the point your raised. For point 2, I just updated the readme with the warning. I didn't implement detection of SSD drives to print a warning.

Release 1.26 has been published with these changes. Let me know if it works for you.

groone commented 1 month ago

Hi @idrassi,

I verified that my main issue with file handle count causing crash in "Cryptomator" virtual drive is resolved in the latest version!

However I noticed a weird behavior with the sorting solution: Using -sumRelativePath with -threads writes relative paths during hashing but after sorting there are full paths in the file.

Also noticed that sort collation difference produces a different (but reproducible) order when using -threads. For example "Folder/file", "Folder-1/file", "Folder_1/file" lines are in different order if using -threads

Thanks for publishing awesome and very useful tools!

idrassi commented 1 month ago

Thank you, @groone, for the tests and the feedback.

Indeed, during sorting, I mistakenly reused a piece of code that automatically converts paths to absolute values. I will fix this.

Regarding the sorting order difference, it is due to the fact that without -threads, the sorting takes into account the directory structure (putting directories first and then files), while the sorting with -threads treats all path values equally, without considering the directory structure. For example, without -threads, we would have:

dev\crypto\file.txt
dev\crypto.7z

But with -threads, we have:

dev\crypto.7z
dev\crypto\file.txt

since '.' comes before '\' in lexicographical order.

The same applies to your case: without -threads, the folder names are sorted first, then their content. With -threads, everything is sorted together, which places "Folder-1\file" before "Folder\file.txt" since '-' comes before '\' in lexicographical order.

To fix this, I will need to implement logic to reconstruct the folder structure and then emulate the same flow as when traversing the folder structure. I will let you know when I have something working.

idrassi commented 1 month ago

@groone I have implemented the changes to fix the above issues : https://github.com/idrassi/DirHash/commit/b0d6709bbbd05217da48670189cf4d839efd0848

I published version 1.26.1 that includes the fix. Let me know if you are encountering any issues.