opencadc / caom2

Common Archive Observation Model
GNU Affero General Public License v3.0
5 stars 11 forks source link

RFE: add checksum to Artifact #15

Closed pdowler closed 7 years ago

pdowler commented 7 years ago

Proposal from the HST Archive Coordination Meeting: add a checksum (probably MD5) to the artifact so that metadata sharing of CAOM observation metadata provides sufficient information to enable partners to figure out which data files they need to download. In the case of new artifacts, the partner won't have the file (denoted by the Artifact.uri) at all. For changed files, they will detect this via the checksum. For changed arifact metadata, the partner would examine the artifact due to timestamp change but can determine from the checksum that they do not need to download the data again.

timj commented 7 years ago

Checksums are really important (how else can you check that the file on your disk has not become corrupt?). I think MD5 is probably fine given that you aren't worrying about a bad actor replacing a FITS file with a different file that has the same checksum in order to deliberately compromise a data set. Comparing shasum with md5 commands I see that the SHA512 calculation can be twice as long than MD5 but SHA1 is almost as fast. You should consider using SHA1 though to reduce the chance of randomly getting a duplicate checksum (although that's probably irrelevant if you are never treating the checksum independently of the file its associated with).

wlandry commented 7 years ago

The checksum should either be a non-cryptographic but fast checksum (e.g. CRC 32) or a real cryptographic checksum. MD5 is neither. I vote for SHA512. It has been around a long time and has proven resistant to cryptanalysis (unlike SHA-1). SHA512 is faster than SHA256 for files larger than 16 bytes.

http://crypto.stackexchange.com/questions/26336/sha512-faster-than-sha256

mdolensk commented 7 years ago

Good call. The SKA processing and storage system design foresees a checksumming capability for data integrity purposes without mandating a particular one at this stage. MD5 is ok from that PoV. NGAS currently uses CRC32 and is part of numerous large volume mission archives such as ALMA, ASKAP, ESO, FAST, MWA and VLA. Support for CRC32C was added recently. The advantages of CRC32C over CRC32 are hardware support and fewer hash collisions. Hardware support is becoming increasingly important. On an XFS based NGAS disk storage array the checksumming is no longer I/O bound, but can saturate the CPU at a few GB/s throughput. There's obviously the issue of incompatible checksums for existing data.

pdowler commented 7 years ago

Since different implementers of CAOM may want to use different checksum algorithms, I will add this so that both the algorithm and value can be specified. A URI where the scheme specifies the algorithm and the value is the string serialisation would be compact, flexible, and a single field, eg:

md5:b1724b797bb59c299a3b78fb3eb0e7d6 sha256:aef0785c69b45710e284a6574eafd9680c41f1c9acaeb6bff93f7eab0c1d14f6

See https://github.com/opencadc/caom2/issues/22

To be included in CAOM-2.3

pdowler commented 7 years ago

Added in CAOM-2.3