Hash/checksum function - Githubissues

Arithmeticus commented 12 months ago

I propose a new function for the core XPath functions, here called fn:hash() for the sake of discussion. The goal is to give XPath users access to CRC, checksum, and cryptographic hash functions.

Rationale

Simple checksums functions, such as those from the Fletcher family, are relatively easy to write in a host language such as XSLT. More complex ones are far more challenging to write, and may incur serious performance penalties. For example, from the TAN function library, see the MD5 checksum/hash functions. (Yes, one day I thought it would be fun to try to implement the MD5 algorithm in XSLT.) Most programming languages in which an implementation is written have access to cryptographic libraries that are highly performative.

Hash functions are widely used, and certainly important in XML-based workflows, whether as filenames, database fields, etc.

The closest comparable existing functions is generate-id(), but this was designed as an identifier for nodes.

In short, I believe there is a significant need that outstrips current functionality.

In the draft below, I have adopted only the MD-5 algorithm as a core requirement, to catalyze discussion. I have assumed the user wants the string form of the output, not the raw bits. I have not tried to flesh out prose that would warn users away from security complacency.

For discussion, here is a list of relevant algorithms. I look forward to community feedback.

`fn:hash` Draft Specification

Summary

This function takes as input a string or octet sequence and returns a string representation of the results from a specified hash, checksum, or cyclic redundancy check function.

Signature

fn:hash(
   $value as union(xs:string, xs:hexBinary, 
                   xs:base64Binary)?   := fn:string(.),
   $algorithm as xs:string             := "md5"
) as xs:string?

Properties

The zero-argument form of this function is deterministic, context-dependent, and focus-dependent.

The one- and two-argument form of this function is deterministic, context-independent, and focus-independent.

Rules

If the zero-argument version of the function is used, the result is the same as calling the one-argument version, with $value set to fn:string(.).

If the one-argument version of the function is used, the result is the same as calling the two-argument version, with $algorithm set to "md5".

The effective value of $algorithm is the value of the expression fn:lower-case(fn:replace($function, '\W+', '')).

If $value is the empty sequence, or a string of zero length, the function returns the empty sequence.

If $value is an instance of xs:string it is cast to xs:hexBinary on the basis of UTF-8 encoding. If $value is an instance of xs:base64Binary it is cast to xs:hexBinary.

The function returns an xs:string representation of the bytes returned by passing the xs:hexBinary value of $value as an octet sequence through the specified hash or checksum function.

Conforming implementations MUST support md5 and the associated MD5 Message-Digest algorithm defined by RFC 6151 (update to RFC 1321). They MAY support other checksum and hash functions with implementation-defined semantics.

Error Conditions

A dynamic error is raised [err:XXXXXXX] if the effective value of $algorithm is not one of the values supported by the implementation.

Notes

The MD5 algorithm is normally not used for cryptographic purposes. [More cautionary prose about not assuming that something can be trusted as secure.]

Examples

Expression	Result
fn:hash("abc")	900150983cd24fb0d6963f7d28e17f72
fn:hash("ABC")	902fbdd2b1df0c4f70b4a5d23525e932

ChristianGruen commented 12 months ago

Makes sense (similar: https://docs.basex.org/wiki/Hashing_Module#hash:hash).

ndw commented 12 months ago

I think this makes sense too. I wonder if we should require at least MD5, SHA-1, and SHA-256 algorithms, or if implementations are likely to simply provide everything the relevant library provides so it isn't really necessary.

ChristianGruen commented 12 months ago

:+1: for at least MD5, SHA-1, and SHA-256.

dnovatchev commented 12 months ago

Great proposal!

About the hashing algorithm, we need to warn people that many of the well-known hashing algorithms have been comromised as per Wikipedia:

"A successful, practical attack broke MD5 used within certificates for Transport Layer Security in 2008.[27]

Many cryptographic hashes are based on the Merkle–Damgård construction. All cryptographic hashes that directly use the full output of a Merkle–Damgård construction are vulnerable to length extension attacks. This makes the MD5, SHA-1, RIPEMD-160, Whirlpool, and the SHA-256 / SHA-512 hash algorithms all vulnerable to this specific attack."

"SHA-3, BLAKE2, BLAKE3, and the truncated SHA-2 variants are not vulnerable to this type of attack."

Last year I had to choose and decided to use an available implementation of the Blake3 algorithm -- written in Rust and very efficient.

Also, we need to support "salting" in addition to hashing. For examples of hashes of compromised passwords, see CrackStation. Anyone who has seen this will understand that support for "salting" is critical to avoid such exploits.

dnovatchev commented 12 months ago

👍 for at least MD5, SHA-1, and SHA-256.

All of these have been compromised.

According to Wikipedia:

"SHA-3, BLAKE2, BLAKE3, and the truncated SHA-2 variants are not vulnerable to this type (length extension) of attack."

ndw commented 12 months ago

This isn't about cryptographic security. Don't roll your own crypto.¹

MD5, SHA-1, and SHA-256 are useful because they're often used in APIs and protocols for checksums or as quick "is this (exceptionally likely to be) the same file" checks. All those git hashes are SHA-1 checksums of the files in the commit, for example.

¹ That said, I'm all for implementations supporting longer, more robust algorithms.

michaelhkay commented 12 months ago

Do we need something similar for XML, as distinct from strings? One way to achieve that is to do a canonical serialization and then checksum the serialization. But we don't currently have a function to do a canonical serialization. And if we're going to do that, we should probably implement XML Signature.

ndw commented 12 months ago

I have tried on several occasions to get my head around XML Signature. I have never come close to succeeding and I've never actually encountered it "in the wild." (But I accept that may say more about the kind of wilderness I tromp around in than anything else.)

So, no, I don't think XML Signature is necessary in QT4. I could probably be persuaded to have a serialization option to do canonicalization, especially if it was optional.

I've implemented MD5/SHA hash algorithms as extension functions several times and, AFAICT, the hash strings or binary blobs use cases cover almost everything.

Arithmeticus commented 12 months ago

I'm strongly in favor of exposing a function for canonical serialization, which would greatly assist in regression testing.

dnovatchev commented 12 months ago

Signature

fn:hash(
   $value as union(xs:string, xs:hexBinary, 
                   xs:base64Binary)?   := fn:string(.),
   $algorithm as xs:string             := "md5"
) as xs:string?

As proposed earlier in this thread, let us integrate support for a [salt-function](https://en.wikipedia.org/wiki/Salt(cryptography))_. The default could be fn:identity. or a function that performs an XOR operation on the result-hash with a string of all-one-bits. BTW, it seems we still don't have an XOR operator in the language?

Signature

fn:hash(
   $value as union(xs:string, xs:hexBinary, 
                   xs:base64Binary)?   := fn:string(.),
   $algorithm as xs:string             := "md5".
   $salt-with as function($value as union(xs:string, xs:hexBinary) as xs:string := fn:identity
) as xs:string?

Arithmeticus commented 11 months ago

@dnovatchev I appreciate the value of a salt, but there are a number of different salting strategies, and accommodating all major strategies could be challenging. It also seems to me that ideally, people should work on the salt before approaching the hash function, at which point it's a simple concatenation operation, e.g., hash($base || $salt).

Do you see any compelling arguments for making salting an intrinsic part of a proposed fn:hash function? I see a clear one against: All good functions do only one thing (and do it well). A hash function with a salting option does not do only one thing. Therefore a hash function with a salting option is not a good function.

dnovatchev commented 11 months ago

@dnovatchev I appreciate the value of a salt, but there are a number of different salting strategies, and accommodating all major strategies could be challenging. It also seems to me that ideally, people should work on the salt before approaching the hash function, at which point it's a simple concatenation operation, e.g., hash($base || $salt).

Do you see any compelling arguments for making salting an intrinsic part of a proposed fn:hash function? I see a clear one against: All good functions do only one thing (and do it well). A hash function with a salting option does not do only one thing. Therefore a hash function with a salting option is not a good function.

@Arithmeticus We can disagree on this.

If this were true, then any function composition or, in fact, any higher-order function, and the well-known Strategy and Visitor patterns would be "not good".

Making the user separately apply salt to a string and then hashing the result is not only more "inconvenient" for the user, but even worse, it leaves the possibility that the user may forget or neglect to use salt at all.

By providing as parameter to the hashing function a separate salting function, we remind the user that salt is really important and should not be missed.

Here the reasoning that "a function that does more than one thing is not a good function" does not hold ground, because hashing and salting a closely related and hashing without using salt is proven to open doors to attackers. On the contrary, doing closely related things in one place is highly recommended as this leads to high cohesion, and this is in fact the goal of the Single Responsibility Principle (the S in SOLID).

Moreover, the hashing function doesn't at all engage in implementing the salting - this is passed as an argument-function and is implemented entirely within this function. Thus, the Single Responsibility Principle is not violated at all - the code implementing the salting is not within the hashing function.

Also, it is both powerful and secure that the hashing function doesn't know at all the internals/semantics of the salting function that is passed to it. Thus different salting strategies may be passed on each call, and we are not revealing any particular salting implementation.

michaelhkay commented 11 months ago

I'm not sure we have a consensus on the intended usage of this function. For simple checksumming applications, salting seems to be unnecessary.

dnovatchev commented 11 months ago

I'm not sure we have a consensus on the intended usage of this function. For simple checksumming applications, salting seems to be unnecessary.

@michaelhkay Completely agree with this statement.

However we are rarely successful in predicting the exact future usage of a particular feature by the unknown/hypothetical user. A good example is the decision of the then W3C WG on XPath and XQuery not to provide an XPath construct that facilitates the creation of recursive function items: https://www.w3.org/Bugs/Public/show_bug.cgi?id=8662#c9

It was thought at that time that XPath must be intentionally restricted not to be too-powerful, and what better way than to make recursion "impossible" ...

Could they anticipate that this so called "obstacle" would actually become the challenging driving force for actually showing that such recursion, and even mutual recursion, is indeed possible?

Given the fact that the proposed hashing function has an $algorithm argument, whose default value is "md5", but its stated goal is:

to give XPath users access to CRC, checksum, and cryptographic hash functions.

means that the potential use of this function for cryptographic purposes has not been missed by the author, and even is stated as one of the desired results.

Also, if we take into account the fact that the proposed default value for the salting function is fn:identity - a de-facto no-op, this means that the main (default) case of using this function would indeed remain in simple checksumming applications, thus only a knowledgeable user can intentionally use this function for cryptographic-related goals.

I personally find this the perfect tradeoff between simplicity and power and wish we had more standard functions that are easy to use in the common case for the general/casual user, but at the same time provide enough powerful features (not turned on by default) to the powerful user in specific cases.

ChristianGruen commented 11 months ago

In our implementation, we have both a Hash Module and a Cryptographic Module (based on the EXPath spec). Both serve different purposes. If we add support for more cryptographic features, I think it would be a bad idea to include those in the basic hash functions.

benibela commented 11 months ago

In our implementation, we have both a Hash Module and a Cryptographic Module (based on the EXPath spec). Both serve different purposes. If we add support for more cryptographic features, I think it would be a bad idea to include those in the basic hash functions.

that makes more sense as modules than as a standard function

Then the user can just call a function like hash:sha-1, and know what he gets. Rather than setting an algorithm argument that might be supported or not

ChristianGruen commented 8 months ago

Closed (added to the spec).

qt4cg / qtspecs

Hash/checksum function #779

Rationale

`fn:hash` Draft Specification

Summary

Signature

Properties

Rules

Error Conditions

Notes

Examples

Signature

Signature

qt4cg / qtspecs

Hash/checksum function #779

Rationale

fn:hash Draft Specification

Summary

Signature

Properties

Rules

Error Conditions

Notes

Examples

Signature

Signature

`fn:hash` Draft Specification