huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.75k stars 26.73k forks source link

Hashlib usage is underspecified #27034

Closed DueViktor closed 11 months ago

DueViktor commented 12 months ago

Feature request

From python 3.9 hashlib introduced the usedforsecurity argument:

Changed in version 3.9: All hashlib constructors take a keyword-only argument usedforsecurity with default value True. A false value allows the use of insecure and blocked hashing algorithms in restricted environments. False indicates that the hashing algorithm is not used in a security context, e.g. as a non-cryptographic one-way compression function.

transformers use hashing in many cases where the purpose is indeed not for security purposes. This should be specifed in the code.

Motivation

Transformers use MD5 from hashlib, which is not a secure algorithm, but are not specifying that it is for other purposes than security. This is causing issues for organisations following certain security standard. FIPS compliance could be an example.

Your contribution

I will attach a PR specifying the usage of hashlib algorithms. Since usedforsecurity is only specified from 3.9+ and transformers support 3.6+, I'll add a functionality to detect python version and change kwargs based on that.

ArthurZucker commented 12 months ago

Hey! Thanks for reporting I'll see if this relevant for us 🤗

DueViktor commented 12 months ago

Great @ArthurZucker. The pull request have passed all tests already and are ready to merge. No behaviour is changed.

My guess is that pretty much all federal systems in the world would have this issue.

Federal Information Processing Standards (FIPS) 140-2 is a mandatory standard for the protection of sensitive or valuable data within Federal systems. - https://www.wolfssl.com/license/fips/

Wauplin commented 11 months ago

Hey @DueViktor! Coming back to you about this request. We've finally specified hashlib usage in huggingface_hub, transformers, datasets and diffusers. Everything's merged now so I'll close this issue. Thanks again for the heads up!

DueViktor commented 11 months ago

Hi @Wauplin! Thanks so much for the update and for addressing the hashlib usage across all those libraries. Appreciate your team's prompt action on this matter. Keep up the fantastic work!