podaac / data-subscriber

Subscribe and bulk download collections of data at PO.DAAC
Apache License 2.0
83 stars 29 forks source link

Fix support for SHA-256 and SHA-512 algorithms #83

Closed wveit closed 2 years ago

wveit commented 2 years ago

Addresses #82

Added code to remove hypens from CMR hash algorithm name (it already lowercased), when getting the hash function from python's hashlib.

As mentioned in issue #82, the possible algorithm types in CMR are ["Adler-32", "BSD checksum", "Fletcher-32", "Fletcher-64", "MD5", "POSIX", "SHA-1", "SHA-2", "SHA-256", "SHA-384", "SHA-512", "SM3", "SYSV"].

With this change, the following will be supported: MD5, SHA-1, SHA-256, SHA-384, SHA-512.

wveit commented 2 years ago

About the other algorithms that aren't supported yet.

I'm not sure which algorithm to map SHA-2 to. It looks like this is the name for the family of algorithms (including SHA-256, SHA-384, SHA-512, etc), but does not have a defined bit length itself.

SM3 is listed under hashlib.algorithms_available on my machine, but not hashlib.algorithms_guaranteed. Thus, that would work on my computer. But I'm not sure where it wouldn't work. Thus I didn't include a test for it so that it wouldn't be flaky.

Adler-32 support is available in python's built in zlib library, although it requires some additional [small changes]. Given that there are other algorithms that I have questions about, I held off on making that change for now.

The Fletcher, POSIX, BSD and SYSV algorithms... I didn't see anything in the built-in python libraries, and am still looking at solutions for these.

I figured it's best to go ahead and submit the changes for the most important algorithms (SHA-256 and SHA-512) now, and maybe follow up on the other algorithms in another issue/PR. Let me know what you think.

mike-gangl commented 2 years ago

i agree that fixing the one we know is a problem is most important. i'm not sure how to get a good view into what checksums we actually use- i doubt it's anything other than MD5 and SHA...