Open rupesh2 opened 7 months ago
When checksums are available, what do we think the behavior should be?
In my weak opinion, by default, earthaccess should verify and print a warning if verification fails. We can provide arguments to disable verification or to upgrade those warnings to errors.
Thanks @mfisher87 ! Printing a warning when the verification fails would be a good start.
Some examples of DAACs using checksums: Organization | Algorithm | Examples |
---|---|---|
ORNLDAAC | SHA-256 | https://cmr.earthdata.nasa.gov/search/concepts/G2625060389-ORNL_CLOUD.umm_json "DataGranule":{"ArchiveAndDistributionInformation":{"Checksum":{"Value":"e15a43eb6914bf594833ff40d9c849adf08acdfa13b67e343308cceb5901b462","Algorithm":"SHA-256"}}} |
PODAAC | MD5 | https://cmr.earthdata.nasa.gov/search/concepts/G2857127720-POCLOUD.umm_json "DataGranule":{"ArchiveAndDistributionInformation":{"Checksum":{"Value":"210130f6e8f61d7976f5405f9e925f98","Algorithm":"MD5"}}} |
ASF* | MD5 | https://cmr.earthdata.nasa.gov/search/concepts/G2895561045-ASF.umm_json "AdditionalAttributes":[{"Name":"MD5SUM","Values":["764bf6dbe12eaf73f8e316924b409ded"]}] |
LAADS | MD5 | https://cmr.earthdata.nasa.gov/search/concepts/G2895709317-LAADS.umm_json "DataGranule":{"ArchiveAndDistributionInformation":{{"Value":"27504ce476722f8c6f55551d9dc59455","Algorithm":"MD5"}}} |
LARC | MD5 | https://cmr.earthdata.nasa.gov/search/concepts/G2829371222-LARC_CLOUD.umm_json "DataGranule":{"ArchiveAndDistributionInformation":{{"Value":"92df1ae596bf28bd0b966145ba76599b","Algorithm":"MD5"}}} |
LANCE | MD5 | https://cmr.earthdata.nasa.gov/search/concepts/G2895741728-LANCEMODIS.umm_json "DataGranule":{"ArchiveAndDistributionInformation":{{"Value":"7987f2d56f15da34101dedc671715704","Algorithm":"MD5"}}} |
GHRC | - | |
GESDISC | - | |
LPDAAC | - | |
NSIDC | - | |
CDDIS | - | |
SEDAC | - |
*Checksums are not available for all datasets
Since checksums are not available for all datasets, I'm thinking we should print a warning when we try to verify and checksums aren't available? What do you think?
I was thinking:
Good thinking, I like that too. We can also always make that behavior more configurable with feature flags going forward if users want to be able to customize it
cc @Sherwin-14
@mfisher87 I am thinking of implementing the solution discussed by @rupesh2. Do you have any specific opinions regarding this or should I proceed forward?
I think Rupesh's design sounds like a great path forward.
Next steps after that should probably be flags on earthaccess.download()
:
disable_checksum_validation: bool = False
: Opt out of the validationraise_on_checksum_validation_failure: bool = False
Opt in to raising an exception (instead of logging a warning) when validation fails to enable programmatic handling by the user(I'm sure someone can come up with better argument names than me :laughing:)
Perhaps these should be tackled as separate issues? No strong feelings here :)
I'd prefer something a bit more unified, perhaps by using a single parameter that is not boolean, particularly since disable_checksum_validation=True
means that raise_on_checksum_validation_failure
has no meaning.
Perhaps a single parameter named validation
that is an Enum: WARN
(default), FAIL
, or SKIP
. I'm not sure I'm totally loving that, but that's the sort of direction I'd suggest.
That's a great point, I like your idea :)
Super interesting thread - it looks like we never published checksums in CMR at GES DISC but we have records of that internally and validate using checksums our migration to the cloud - will bring it up internally.
Amazing! Thanks, Brianna :)
Super interesting thread - it looks like we never published checksums in CMR at GES DISC but we have records of that internally and validate using checksums our migration to the cloud - will bring it up internally.
Brianna is correct. At GES DISC, for the granules we previously migrated to the cloud from on-prem, the checksum was not published to CMR due to certain reasons. However, for the further ingest from another cloud data provider, the checksum will be published to CMR if provided by the provider.
Now if earthaccess.download()
can validate granules based on the checksum if any, we should consider adding checksums to our already migrated granules so that earthaccess users can benefit from this feature when getting our data. This would require some effort though, and we will have some internal discussion on that...
This would require some effort though, and we will have some internal discussion on that...
Thanks so much for having this conversation! :bow:
Hi folks - I'm the TL of the Google Earth Engine Data team.
We mirror a lot of datasets. The most common problem we run into is missing assets/files. The second, much more rare one, is truncated files. Truncated files are easily fixed by making sure the jobs are doing atomic copies, but catching missing files can be hard when the dataset listings are massive and continuously updated.
To be honest, I never saw in 15 years a download problem that would be caught only by verifying checksums. They have their value - e.g., we use them to verify data conversion, but we checksum actual data bytes, not just files, because tiny changes in file formats would make checksums change.
So I'd recommend weighing the effort of maintaining and verifying checksums against the usefulness of such checks. I would be much more interested in more robust file listings (e.g., CMR is not easy to scan for huge datasets).
Checksums are available as a part of UMM-G records for some datasets (e.g., Daymet provides
SHA-256
; GHRSST providesMD5
).earthaccess.download()
should verify the integrity of the downloaded granules against the checksum hashes, where available. This work will add such validations for downloaded files.