Closed wilrodriguez closed 1 year ago
I have no problem with this change, but we have to make sure that in attempting to fix this one edge case, we're not creating new breaking changes for other edge cases. (i.e. other python versions or operating systems)
I was mildly concerned about that as well. That said, sha256 is supported as part of the core library in all the versions of python that lark currently supports and this hashing library is only used within lark. The only downside I'd see is that it would need to recache files after the upgrade to the new version, but that's hardly much of a concern. This method should only be used internally. It's certainly not like it provides any novel functionality that a user wouldn't be able to more easily get from just calling hashlib directly. So, I'm not sure it's really anything to worry about.
The recaching already happens since the lark version is part of what get's hashed.
Even better.
@wilrodriguez Thanks for contributing!
Small comment on this. I'm +1 for anything modern over md5, and sha256 on modern machines is faster than md5. I'm not sure how much speed is a factor here, but there are hashers that are faster than sha256 - especially if we don't need them to be cryptographic. However some of these faster hashers are not FIPS validated (although blake3 probably should be):
Here is a small benchmark:
import ubelt as ub
import timerit
ti = timerit.Timerit(100_000, bestof=100, verbose=2)
hashers = ['md5', 'sha256', 'blake3', 'xxh64']
data = 'foobar' * 10_000
for hasher in hashers:
for timer in ti.reset(f'{hasher=}'):
with timer:
ub.hash_data(data, hasher=hasher)
mean_rankings = ub.udict(ti.rankings['mean']).sorted_values()
print(ub.urepr(mean_rankings, nl=2, align=':'))
And the larger benchmark that produced the image is here: https://github.com/Erotemic/timerit/blob/dev/1.0.2/examples/benchmark_hashers.py
Likely not a big deal, but I thought I'd chime in.
Fixes #1250
This change is intended to replace the md5 hashing currently utilized to validate grammar caches with sha256 hashing, which will help mitigate situations where md5 is being blocklisted, such as by FIPS-mode