spdx / spdx-3-model

The model for the information captured in SPDX version 3 standard.
https://spdx.dev/use/specifications/
Other
70 stars 45 forks source link

Change the name of Hash and hashAlgorithm back to Checksum and checksumAlgorithm #90

Closed goneall closed 1 year ago

goneall commented 1 year ago

From reviewing various articles on the differences between Hash and Checksum, it would appear that Checksum is intended for verification purposes and more appropriate to our usage here in SPDX. For example, see the Baeldung article on Hash vs. Checksum.

Not to mention, SPDX 1.0 and later versions use Checksum and checksumAlgorithm. Changing to Hash and hashAlgorithm would introduce unnecessary migration efforts.

seabass-labrax commented 1 year ago

I believe that the Baeldung article is incorrect in its claim that checksums are "integrity-based hashing functions". There are plenty of integrity checking mechanisms that are not hash functions: Hamming codes, Cyclic Redundancy Checks, Reed-Solomon codes etc. None of these are in SPDX, where the 'checksum algorithm' choices are exclusively hash functions; in particular they are all cryptographic hash functions.

There are a number of SPDX stakeholders who wish to use SPDX for security purposes, ensuring that supply chain artefacts haven't been tampered with. Non-cryptographic checksums and hash functions can't detect malicious and deliberate modification, making them inadequate for this use-case.

All in all, I would say that calling these values in SPDX 3+ 'hashes' is more accurate, and in my opinion isn't going to introduce much in additional migration difficulty considering the structural changes.

davaya commented 1 year ago

The definition of a hash function is that it converts an arbitrary-length input to a fixed-length output, which is also the definition of a checksum algorithm. I don't have a strong preference which term is used, but I object to the implication that a function that is a hash is not a checksum, or vice versa.

"Cryptographic checksum" and "cryptographic hash" refer to the identical thing: functions of data that allow data integrity to be verified. Using "checksum" to mean something that is not resistant to cryptographic attack and "hash" to mean something that is is incorrect. "Hash" includes non-cryptographic hashes in languages both old (Perl) and new (Rust, where the hash value is short (64 bit - subject to birthday collisions) and distinct from cryptographic hashes by being keyed.

Changing the name doesn't introduce significant complexity as Sebastian says, but if it ain't broke in 2.3, there's no need to fix it, Non-cryptographic hashes and checksums are one thing, cryptographic hashes and checksums are another, and as technology advances cryptographic-quality algorithms will become non-cryptographic with no change to the algorithm itself.

goneall commented 1 year ago

William and Gary will sync up on the topic

goneall commented 1 year ago

After sync'ing with William, I now agree with the name change based on a common understanding of the difference in purpose of a checksum and hash.

I updated the migration guide with a justification for the change.

If anyone feels like we should keep the checksum name, please open a new issue.