namazso / OpenHashTab

📝 File hashing and checking shell extension
GNU General Public License v3.0
3.14k stars 122 forks source link

Issue hashing big network folders #116

Open oveand opened 2 years ago

oveand commented 2 years ago

Trying to hash rather big folders, for example 175 GB and 10.000 files located on a Hitachi NAS network share, shows some challenges.

First of all the File Explorer windows freezes for a rather long time before OpenHashTab shows. This is probably due to some file indexing going on before the GUI is actually shown.

Second, and more important, hashing actually fails with an error "Not enough server storage available to process this command". Some files are correctly hashed while others are not. This is probably not a direct OpenHashTab issue but perhaps related to how OpenHashTab accesses network shares.

image

Do you know any way of overcoming this?

namazso commented 2 years ago

Hm... my first guess is that something doesn't let us open/lock that many files (as OpenHashTab tries to open all files before attempting to read anything from any of them). This doesn't cause immediate resource problems on local files because windows allows up to 16 million handles per process, but I imagine SMB has some lower limits on this. The best solution would be just hashing things on the server really. The other alternative is raising open file limits, but with that many files you'll probably hit samba's limit first, then linux's ulimits, then maybe your local client's ones.

oveand commented 2 years ago

I agree this is most likely a problem with the number of open file handles. Unfortunately the files is stored on a NAS device and files are only accessible through SMB. I've been trying to do smaller bulks of hashing bit it quickly gets complicated as some folders has many small files and the folder nesting is a bit complicated :(

But I see this is a design decision that all files are opened before actually working on them. This is probably also the reason for why OpenHashTab can take up to 5 minutes before GUI showing when working with SMB shares (link is 1 Gbit but latency is probably significant).

namazso commented 2 years ago

I have some plans on rewriting / refactoring a major part of the code for 4.0, the API changes of AlgorithmsDll was actually a first step to it (by making the api interop-friendly for use with other languages like Rust or C#, in case I decide to rewrite in that)

But I see this is a design decision that all files are opened before actually working on them.

that is correct, the handles are used for figuring out canonical paths and similar. I plan to rewrite most of the path handling with the undocumented NT api as it's much easier from a security perspective (there's already quite a bit of conversion mess going on with DOS shortpaths, DOS paths, and NT paths), will see if I can get rid of opening files early. However it'd still need at least 512 concurrent files being opened as that's how many we're queueing up for Windows to read.

This is probably also the reason for why OpenHashTab can take up to 5 minutes before GUI showing when working with SMB shares (link is 1 Gbit but latency is probably significant).

Partially. Folders are traversed too which takes some time, and all this code is synchronous / blocking. I might try rewriting it in an async / multithreaded way in the future, but that needs lots of coordination. Since in a future version I want to allow adding files after the initial bunch (maybe even mid-process too) this part of the code will need quite some overhaul anyways.

oveand commented 2 years ago

Experiments shows that approximately 16,000 files can be opened on the Hitachi NAS storing the our files and 512 concurrent files being opened should not be an issue for most relative modern systems. The 16.000 though seems to be a global limit and running parallel OpenHashTab reduces the number of files which can be handled.

I understand why handling this challenge is a radical change to the current file handling design and I'm grateful you taking this challenge into account for a v4.0

mmortal03 commented 2 months ago

Hm... my first guess is that something doesn't let us open/lock that many files (as OpenHashTab tries to open all files before attempting to read anything from any of them). This doesn't cause immediate resource problems on local files because windows allows up to 16 million handles per process

Something that might be relevant to this is that I just tried right clicking and selecting Hashes on a previously generated corz hash file of an entire drive's contents (~177,000 files, 3.15 TB total size of the files on the drive).

OpenHashTab did not open up. Instead, the Explorer window started "Not Responding". So, I went into Resource Monitor, Disk tab, and noticed that there were a large number of MsMpEng.exe (Windows Security) entries, one generated for each file, seemingly being read through at lower than maximum disk speed.

It seems that, essentially, Windows Security was bottlenecking the whole process, probably doing a scan of every single file before OpenHashTab would even open? So, I went into Windows Security and temporarily disabled Real-time Protection and checked Resource Monitor again, which showed a flurry of higher read speeds, and then OpenHashTab finally popped up and automatically started scanning all the files itself.

Maybe OpenHashTab should at least first pop up before creating all these file handles, asking the user what they want to do, so as not to lock up the Explorer process? I don't know the programmatic solution to avoiding Windows Security scanning all the files, but this must at least be something that other hash scanning programs have figured out how to deal with.