python / cpython

The Python programming language
https://www.python.org
Other
63.35k stars 30.33k forks source link

Add a file_digest() function in hashlib #89313

Closed 508f7627-3797-4332-8a63-a38898672f33 closed 2 years ago

508f7627-3797-4332-8a63-a38898672f33 commented 3 years ago
BPO 45150
Nosy @gpshead, @tiran, @tarekziade, @SonOfLilit, @miss-islington
PRs
  • python/cpython#28252
  • python/cpython#31928
  • python/cpython#31930
  • python/cpython#32046
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = 'https://github.com/tiran' closed_at = None created_at = labels = ['type-feature', 'library', '3.11'] title = 'Add a file_digest() function in hashlib' updated_at = user = 'https://github.com/tarekziade' ``` bugs.python.org fields: ```python activity = actor = 'christian.heimes' assignee = 'christian.heimes' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'tarek' dependencies = [] files = [] hgrepos = [] issue_num = 45150 keywords = ['patch'] message_count = 14.0 messages = ['401457', '401461', '401462', '415138', '415139', '415141', '415320', '415321', '415324', '415326', '415336', '415356', '415754', '415793'] nosy_count = 6.0 nosy_names = ['gregory.p.smith', 'christian.heimes', 'tarek', 'python-dev', 'Aur.Saraf', 'miss-islington'] pr_nums = ['28252', '31928', '31930', '32046'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue45150' versions = ['Python 3.11'] ```

    508f7627-3797-4332-8a63-a38898672f33 commented 3 years ago

    I am proposing the addition of a very simple helper to return the hash of a file.

    tiran commented 3 years ago

    Hey Tarek, long time no see!

    In a perfect world, the hash and hmac objects should get an "update_file" method. The OpenSSL-based hashes could even release the GIL and utilize OpenSSL's BIO layer to avoid any Python overhead.

    508f7627-3797-4332-8a63-a38898672f33 commented 3 years ago

    Hey Christian, I hope things are well for you! Thanks for all the precious feedback, I'll rework the patch accordingly

    5bca3d9b-e5ba-45e4-8c9a-0624622cfddc commented 2 years ago

    Tarek,

    Are you still working on this? Would you like me to take over?

    Aur

    508f7627-3797-4332-8a63-a38898672f33 commented 2 years ago

    @Aur, go for it, I started to implement it and got lost into the details for each backend..

    5bca3d9b-e5ba-45e4-8c9a-0624622cfddc commented 2 years ago

    OK, I'll give it a go.

    5bca3d9b-e5ba-45e4-8c9a-0624622cfddc commented 2 years ago

    PR contains a draft implementation, would appreciate some review before I implement the same interface on all builtin hashes as well as OpenSSL hashes.

    5bca3d9b-e5ba-45e4-8c9a-0624622cfddc commented 2 years ago

    The rationale behind from_raw_file() and the special treatment of non-buffered IO is that there is no read_buffer() API or other clean way to say "I want to read just what's currently in the buffer so that from now on I could read directly from the file descriptor without harm".

    If you want to read from a buffered file object, sure, just call from_file(). If you want to ensure you'll get full performance benefits, call from_raw_file(). If you pass an eligible file object to from_file() you'll get the benefits anyway, because why not.

    5bca3d9b-e5ba-45e4-8c9a-0624622cfddc commented 2 years ago

    Forgot an important warning: this is the first time I write C code against the Python API, and I didn't thoroughly read the guide (or at all, to be honest). I think I did a good job, but please suspect my code of noob errors.

    I'm especially not confident that it's OK to not do any special handling of signals. Can read() return 0 if it was interrupted by a signal? This will stop the hash calculation midway and behave as if it succeeded. Sounds suspiciously like something we don't want. Also, I probably should support signals because such a long operation is something the user definitely might want to interrupt?

    May I have some guidance please? Would it be enough to copy the code from fileutils.c _Py_Read() and addi an outer loop so we can do many reads with the GIL released and still call PyErr_CheckSignals when needed with the GIL taken?

    5bca3d9b-e5ba-45e4-8c9a-0624622cfddc commented 2 years ago

    Added an attempt to handle signals. I don't think it's working, because when I press Ctrl+C while hashing a long file, it only raises KeyboardInterrupt after waiting the amount of time it usually takes the C code to return, but maybe that's not a good test?

    tiran commented 2 years ago

    Before we continue hacking on an implementation, let's discuss some API design.

    5bca3d9b-e5ba-45e4-8c9a-0624622cfddc commented 2 years ago

    I don't think HMAC of a file is a common enough use case to support, but I have absolutely no problem conceding this point, the cost of supporting it is very low.

    I/O in C is a world of pain in general. In the specific case of io.RawIOBase objects (non-buffered binary files) to my understanding it's not that terrible (am I right? Does my I/O code work as-is?). To my understanding, providing a fast path just for this case that calculates the hash without taking the GIL for every chunk would be very nice to have for many use cases.

    Now, we could just be happy with file_digest() having an if for isinstance(io.RawIOBase) that chooses a fast code path silently. But since non-buffered binary files are so hard to tell apart from other types of file-like objects, as a user of this code I would like to have a way to say "I want the fast path, please raise if I accidentally passed the wrong things and got the regular path". We could have file_digest('sha256', open(path, 'rb', buffered=0), ensure_fast_io=True), but I think for this use case raw_file_digest('sha256', open(path, 'rb', buffered=0)) is cleaner.

    In all other cases you just call file_digest(), probably get the Python I/O and not the C I/O, and are still happy to have that loop written for you by someone who knows what they're doing.

    For the same reason I think the fast path should only support hash names and not constructors/functions/etc', which would complicate it because new-object-can-be-accessed-without-GIL wouldn't necessarily apply.

    Does this make sense?

    miss-islington commented 2 years ago

    New changeset 4f97d64c831c94660ceb01f34d51fa236ad968b0 by Christian Heimes in branch 'main': bpo-45150: Add hashlib.file_digest() for efficient file hashing (GH-31930) https://github.com/python/cpython/commit/4f97d64c831c94660ceb01f34d51fa236ad968b0

    tiran commented 2 years ago

    New changeset e03db6d5be7cf2e6b7b55284985c404de98a9420 by Christian Heimes in branch 'main': bpo-45150: Fix testing under FIPS mode (GH-32046) https://github.com/python/cpython/commit/e03db6d5be7cf2e6b7b55284985c404de98a9420

    pablogsal commented 2 years ago

    @tiran Can you made a PR adding the file_digest to the 3.11 what's new, please?

    calestyo commented 1 year ago

    Hey folks.

    I know this is closed and perhaps I should simply file a new request... but would you consider to extend the interface of that function to (efficiently) calculate a file's hashsum for multiple algorithms (i.e. without reading it once for every algo)?

    One could perhaps do so by accepting some array for digest and return a dict where the alogo name is the key and the hashvalue the value.
    Or perhaps something smarter ^^

    Cheers, Chris.

    gpshead commented 1 year ago

    Please file a new feature request issue here for that.