Add a file_digest() function in hashlib

508f7627-3797-4332-8a63-a38898672f33 commented 3 years ago

BPO	45150
Nosy	@gpshead, @tiran, @tarekziade, @SonOfLilit, @miss-islington
PRs	python/cpython#28252 python/cpython#31928 python/cpython#31930 python/cpython#32046

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields: ```python assignee = 'https://github.com/tiran' closed_at = None created_at = labels = ['type-feature', 'library', '3.11'] title = 'Add a file_digest() function in hashlib' updated_at = user = 'https://github.com/tarekziade' ``` bugs.python.org fields: ```python activity = actor = 'christian.heimes' assignee = 'christian.heimes' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'tarek' dependencies = [] files = [] hgrepos = [] issue_num = 45150 keywords = ['patch'] message_count = 14.0 messages = ['401457', '401461', '401462', '415138', '415139', '415141', '415320', '415321', '415324', '415326', '415336', '415356', '415754', '415793'] nosy_count = 6.0 nosy_names = ['gregory.p.smith', 'christian.heimes', 'tarek', 'python-dev', 'Aur.Saraf', 'miss-islington'] pr_nums = ['28252', '31928', '31930', '32046'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue45150' versions = ['Python 3.11'] ```

508f7627-3797-4332-8a63-a38898672f33 commented 3 years ago

I am proposing the addition of a very simple helper to return the hash of a file.

tiran commented 3 years ago

Hey Tarek, long time no see!

the _sha256 module is optional, can be disabled and is not available in some distributions.
I also don't like to use sha256 has default. It's slow, even slower than sha512. Any default makes it also harder to upgrade to a better, more secure default in the future.
like hmac.new() a file_digest() should accept PEP-452-compatible arguments and hash name as digstmod argument, not just a callable.
a filename argument prevents users from passing in file-like objects like BytesIO.
4096 bytes chunk size is very conservative. The call overhead for read() and update() may dominate the performance of the function.
The hex argument feels weird.

In a perfect world, the hash and hmac objects should get an "update_file" method. The OpenSSL-based hashes could even release the GIL and utilize OpenSSL's BIO layer to avoid any Python overhead.

508f7627-3797-4332-8a63-a38898672f33 commented 3 years ago

Hey Christian, I hope things are well for you! Thanks for all the precious feedback, I'll rework the patch accordingly

5bca3d9b-e5ba-45e4-8c9a-0624622cfddc commented 2 years ago

Tarek,

Are you still working on this? Would you like me to take over?

Aur

508f7627-3797-4332-8a63-a38898672f33 commented 2 years ago

@Aur, go for it, I started to implement it and got lost into the details for each backend..

5bca3d9b-e5ba-45e4-8c9a-0624622cfddc commented 2 years ago

OK, I'll give it a go.

5bca3d9b-e5ba-45e4-8c9a-0624622cfddc commented 2 years ago

PR contains a draft implementation, would appreciate some review before I implement the same interface on all builtin hashes as well as OpenSSL hashes.

5bca3d9b-e5ba-45e4-8c9a-0624622cfddc commented 2 years ago

The rationale behind from_raw_file() and the special treatment of non-buffered IO is that there is no read_buffer() API or other clean way to say "I want to read just what's currently in the buffer so that from now on I could read directly from the file descriptor without harm".

If you want to read from a buffered file object, sure, just call from_file(). If you want to ensure you'll get full performance benefits, call from_raw_file(). If you pass an eligible file object to from_file() you'll get the benefits anyway, because why not.

5bca3d9b-e5ba-45e4-8c9a-0624622cfddc commented 2 years ago

Forgot an important warning: this is the first time I write C code against the Python API, and I didn't thoroughly read the guide (or at all, to be honest). I think I did a good job, but please suspect my code of noob errors.

I'm especially not confident that it's OK to not do any special handling of signals. Can read() return 0 if it was interrupted by a signal? This will stop the hash calculation midway and behave as if it succeeded. Sounds suspiciously like something we don't want. Also, I probably should support signals because such a long operation is something the user definitely might want to interrupt?

May I have some guidance please? Would it be enough to copy the code from fileutils.c _Py_Read() and addi an outer loop so we can do many reads with the GIL released and still call PyErr_CheckSignals when needed with the GIL taken?

5bca3d9b-e5ba-45e4-8c9a-0624622cfddc commented 2 years ago

Added an attempt to handle signals. I don't think it's working, because when I press Ctrl+C while hashing a long file, it only raises KeyboardInterrupt after waiting the amount of time it usually takes the C code to return, but maybe that's not a good test?

tiran commented 2 years ago

Before we continue hacking on an implementation, let's discuss some API design.

Multiple functions or multiple dispatch (name, fileobj) are confusing to users. Let's keep it simple and only implement one function that operates on file objects.
The function should work with most binary file-like including open(..., "rb"), BytesIO, SocketIO. mmap.mmap() is not file-like enough. Anon mapping doesn't provide a fileno and the mmap object has no readinto().
The function should accept either digest name, digest constructor, or a callable that returns a digest object. The latter makes it possible to reuse file_digest() with MAC constructs like HMAC.
Don't do any I/O in C unless you are prepared to enter a world of pain and suffering. It's hard to get it right across platforms. For example your C code does not work for SocketIO on Windows.
If we decide to implement an accelerator in C, then we don't have to bother with our fallback copies like _sha256module.c. They are slow and only used when OpenSSL is not available.

5bca3d9b-e5ba-45e4-8c9a-0624622cfddc commented 2 years ago

I don't think HMAC of a file is a common enough use case to support, but I have absolutely no problem conceding this point, the cost of supporting it is very low.

I/O in C is a world of pain in general. In the specific case of io.RawIOBase objects (non-buffered binary files) to my understanding it's not that terrible (am I right? Does my I/O code work as-is?). To my understanding, providing a fast path just for this case that calculates the hash without taking the GIL for every chunk would be very nice to have for many use cases.

Now, we could just be happy with file_digest() having an if for isinstance(io.RawIOBase) that chooses a fast code path silently. But since non-buffered binary files are so hard to tell apart from other types of file-like objects, as a user of this code I would like to have a way to say "I want the fast path, please raise if I accidentally passed the wrong things and got the regular path". We could have file_digest('sha256', open(path, 'rb', buffered=0), ensure_fast_io=True), but I think for this use case raw_file_digest('sha256', open(path, 'rb', buffered=0)) is cleaner.

In all other cases you just call file_digest(), probably get the Python I/O and not the C I/O, and are still happy to have that loop written for you by someone who knows what they're doing.

For the same reason I think the fast path should only support hash names and not constructors/functions/etc', which would complicate it because new-object-can-be-accessed-without-GIL wouldn't necessarily apply.

Does this make sense?

miss-islington commented 2 years ago

New changeset 4f97d64c831c94660ceb01f34d51fa236ad968b0 by Christian Heimes in branch 'main': bpo-45150: Add hashlib.file_digest() for efficient file hashing (GH-31930) https://github.com/python/cpython/commit/4f97d64c831c94660ceb01f34d51fa236ad968b0

tiran commented 2 years ago

New changeset e03db6d5be7cf2e6b7b55284985c404de98a9420 by Christian Heimes in branch 'main': bpo-45150: Fix testing under FIPS mode (GH-32046) https://github.com/python/cpython/commit/e03db6d5be7cf2e6b7b55284985c404de98a9420

pablogsal commented 2 years ago

@tiran Can you made a PR adding the file_digest to the 3.11 what's new, please?

calestyo commented 1 year ago

Hey folks.

I know this is closed and perhaps I should simply file a new request... but would you consider to extend the interface of that function to (efficiently) calculate a file's hashsum for multiple algorithms (i.e. without reading it once for every algo)?

One could perhaps do so by accepting some array for digest and return a dict where the alogo name is the key and the hashvalue the value.
Or perhaps something smarter ^^

Cheers, Chris.

gpshead commented 1 year ago

Please file a new feature request issue here for that.

python / cpython

Add a file_digest() function in hashlib #89313