lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.77k stars 404 forks source link

Replace md5 hashing with sha256. #1251

Closed wilrodriguez closed 1 year ago

wilrodriguez commented 1 year ago

Fixes #1250

This change is intended to replace the md5 hashing currently utilized to validate grammar caches with sha256 hashing, which will help mitigate situations where md5 is being blocklisted, such as by FIPS-mode

erezsh commented 1 year ago

I have no problem with this change, but we have to make sure that in attempting to fix this one edge case, we're not creating new breaking changes for other edge cases. (i.e. other python versions or operating systems)

wilrodriguez commented 1 year ago

I was mildly concerned about that as well. That said, sha256 is supported as part of the core library in all the versions of python that lark currently supports and this hashing library is only used within lark. The only downside I'd see is that it would need to recache files after the upgrade to the new version, but that's hardly much of a concern. This method should only be used internally. It's certainly not like it provides any novel functionality that a user wouldn't be able to more easily get from just calling hashlib directly. So, I'm not sure it's really anything to worry about.

MegaIng commented 1 year ago

The recaching already happens since the lark version is part of what get's hashed.

wilrodriguez commented 1 year ago

Even better.

erezsh commented 1 year ago

@wilrodriguez Thanks for contributing!

Erotemic commented 1 year ago

Small comment on this. I'm +1 for anything modern over md5, and sha256 on modern machines is faster than md5. I'm not sure how much speed is a factor here, but there are hashers that are faster than sha256 - especially if we don't need them to be cryptographic. However some of these faster hashers are not FIPS validated (although blake3 probably should be):

image

Here is a small benchmark:

import ubelt as ub
import timerit

ti = timerit.Timerit(100_000, bestof=100, verbose=2)

hashers = ['md5', 'sha256', 'blake3', 'xxh64']

data = 'foobar' * 10_000

for hasher in hashers:
    for timer in ti.reset(f'{hasher=}'):
        with timer:
            ub.hash_data(data, hasher=hasher)

mean_rankings = ub.udict(ti.rankings['mean']).sorted_values()
print(ub.urepr(mean_rankings, nl=2, align=':'))

And the larger benchmark that produced the image is here: https://github.com/Erotemic/timerit/blob/dev/1.0.2/examples/benchmark_hashers.py

Likely not a big deal, but I thought I'd chime in.