nsidc / earthaccess

Python Library for NASA Earthdata APIs
https://earthaccess.readthedocs.io/
MIT License
388 stars 75 forks source link

Checksum verification of downloaded granules #455

Open rupesh2 opened 7 months ago

rupesh2 commented 7 months ago

Checksums are available as a part of UMM-G records for some datasets (e.g., Daymet provides SHA-256; GHRSST provides MD5).

earthaccess.download() should verify the integrity of the downloaded granules against the checksum hashes, where available. This work will add such validations for downloaded files.

mfisher87 commented 7 months ago

When checksums are available, what do we think the behavior should be?

In my weak opinion, by default, earthaccess should verify and print a warning if verification fails. We can provide arguments to disable verification or to upgrade those warnings to errors.

rupesh2 commented 7 months ago

Thanks @mfisher87 ! Printing a warning when the verification fails would be a good start.

rupesh2 commented 5 months ago
Some examples of DAACs using checksums: Organization Algorithm Examples
ORNLDAAC SHA-256 https://cmr.earthdata.nasa.gov/search/concepts/G2625060389-ORNL_CLOUD.umm_json
"DataGranule":{"ArchiveAndDistributionInformation":{"Checksum":{"Value":"e15a43eb6914bf594833ff40d9c849adf08acdfa13b67e343308cceb5901b462","Algorithm":"SHA-256"}}}
PODAAC MD5 https://cmr.earthdata.nasa.gov/search/concepts/G2857127720-POCLOUD.umm_json
"DataGranule":{"ArchiveAndDistributionInformation":{"Checksum":{"Value":"210130f6e8f61d7976f5405f9e925f98","Algorithm":"MD5"}}}
ASF* MD5 https://cmr.earthdata.nasa.gov/search/concepts/G2895561045-ASF.umm_json
"AdditionalAttributes":[{"Name":"MD5SUM","Values":["764bf6dbe12eaf73f8e316924b409ded"]}]
LAADS MD5 https://cmr.earthdata.nasa.gov/search/concepts/G2895709317-LAADS.umm_json
"DataGranule":{"ArchiveAndDistributionInformation":{{"Value":"27504ce476722f8c6f55551d9dc59455","Algorithm":"MD5"}}}
LARC MD5 https://cmr.earthdata.nasa.gov/search/concepts/G2829371222-LARC_CLOUD.umm_json
"DataGranule":{"ArchiveAndDistributionInformation":{{"Value":"92df1ae596bf28bd0b966145ba76599b","Algorithm":"MD5"}}}
LANCE MD5 https://cmr.earthdata.nasa.gov/search/concepts/G2895741728-LANCEMODIS.umm_json
"DataGranule":{"ArchiveAndDistributionInformation":{{"Value":"7987f2d56f15da34101dedc671715704","Algorithm":"MD5"}}}
GHRC -  
GESDISC -  
LPDAAC -  
NSIDC -  
CDDIS -  
SEDAC -  

*Checksums are not available for all datasets

mfisher87 commented 5 months ago

Since checksums are not available for all datasets, I'm thinking we should print a warning when we try to verify and checksums aren't available? What do you think?

rupesh2 commented 5 months ago

I was thinking:

mfisher87 commented 5 months ago

Good thinking, I like that too. We can also always make that behavior more configurable with feature flags going forward if users want to be able to customize it

mfisher87 commented 3 months ago

cc @Sherwin-14

Sherwin-14 commented 3 months ago

@mfisher87 I am thinking of implementing the solution discussed by @rupesh2. Do you have any specific opinions regarding this or should I proceed forward?

mfisher87 commented 3 months ago

I think Rupesh's design sounds like a great path forward.

Next steps after that should probably be flags on earthaccess.download():

(I'm sure someone can come up with better argument names than me :laughing:)

Perhaps these should be tackled as separate issues? No strong feelings here :)

chuckwondo commented 3 months ago

I'd prefer something a bit more unified, perhaps by using a single parameter that is not boolean, particularly since disable_checksum_validation=True means that raise_on_checksum_validation_failure has no meaning.

Perhaps a single parameter named validation that is an Enum: WARN (default), FAIL, or SKIP. I'm not sure I'm totally loving that, but that's the sort of direction I'd suggest.

mfisher87 commented 3 months ago

That's a great point, I like your idea :)

briannapagan commented 3 weeks ago

Super interesting thread - it looks like we never published checksums in CMR at GES DISC but we have records of that internally and validate using checksums our migration to the cloud - will bring it up internally.

mfisher87 commented 3 weeks ago

Amazing! Thanks, Brianna :)

hailiangzhang commented 3 weeks ago

Super interesting thread - it looks like we never published checksums in CMR at GES DISC but we have records of that internally and validate using checksums our migration to the cloud - will bring it up internally.

Brianna is correct. At GES DISC, for the granules we previously migrated to the cloud from on-prem, the checksum was not published to CMR due to certain reasons. However, for the further ingest from another cloud data provider, the checksum will be published to CMR if provided by the provider.

Now if earthaccess.download() can validate granules based on the checksum if any, we should consider adding checksums to our already migrated granules so that earthaccess users can benefit from this feature when getting our data. This would require some effort though, and we will have some internal discussion on that...

mfisher87 commented 3 weeks ago

This would require some effort though, and we will have some internal discussion on that...

Thanks so much for having this conversation! :bow:

simonff commented 3 weeks ago

Hi folks - I'm the TL of the Google Earth Engine Data team.

We mirror a lot of datasets. The most common problem we run into is missing assets/files. The second, much more rare one, is truncated files. Truncated files are easily fixed by making sure the jobs are doing atomic copies, but catching missing files can be hard when the dataset listings are massive and continuously updated.

To be honest, I never saw in 15 years a download problem that would be caught only by verifying checksums. They have their value - e.g., we use them to verify data conversion, but we checksum actual data bytes, not just files, because tiny changes in file formats would make checksums change.

So I'd recommend weighing the effort of maintaining and verifying checksums against the usefulness of such checks. I would be much more interested in more robust file listings (e.g., CMR is not easy to scan for huge datasets).