srl295 / srl-unicode-proposals

Unicode proposals
Other
4 stars 1 forks source link

Algorithmic Agility #9

Open indolering opened 7 years ago

indolering commented 7 years ago

You appear to be hardcoding SHA-2, this is a really bad idea as SHA-2 might be broken at some point or others may prefer a faster function, such as BLAKE2. You should follow the standard procedure and parameterize the choice of the function itself. An arbitrary number of functions may be appreciated as well.

keithw commented 7 years ago

Thanks for the suggestion! How would you propose that the encoder choose a hash function that it knows that the decoders can support? Would the Unicode committee specify a list of mandatory-to-implement hash functions for decoders, and the encoder would have a choice of them? Given a particular string of code points, how would the decoder know what hash function to try (to find a matching image in its local font or collection of pictures)?

I think if you look at systems that actually use hashes as a global source of identity in offline encoding (example: Git, Bitcoin), it is absolutely (and perhaps unfortunately) not the case that agility in this respect is the "standard procedure."

indolering commented 7 years ago

How would you propose that the encoder choose a hash function that it knows that the decoders can support?

You can't do protocol negotiation, so the best you can do is upgrade the hash function as time goes by.

Would the Unicode committee specify a list of mandatory-to-implement hash functions for decoders, and the encoder would have a choice of them?

Sure? SHA-2 and SHA-3 are good places to start, BLAKE2 and KangarooTwelve are the favorites for fast hashing. More than that and it becomes a PITA to implement.

Given a particular string of code points, how would the decoder know what hash function to try (to find a matching image in its local font or collection of pictures)?

You could use something like multihash to signal in-band. However, that is brand new and non-standard, I'm planning on digging into it later.

I think if you look at systems that actually use hashes as a global source of identity in offline encoding (example: Git, Bitcoin), it is absolutely (and perhaps unfortunately) not the case that agility in this respect is the "standard procedure."

Bitcoin uses redundant hashing, but yeah it's a depressing state of affairs. Everyone thinks that the lack of attacks on SHA-2 mean that it's probably safe, which is delusional. I'm working on a blogpost arguing for agility (and redundancy if you can't manage that).

srl295 commented 6 years ago

multihash above seems like it could be a good path. It looks simple enough to implement. An implementation recommendation could specify a suggested hash.

The hashed contents are basically a matter of agreement between the sender and receiver anyway, and so a particular repository/delivery mechanism might support a certain set of algorithms, support hash upgrading in some form, etc.

The downside here is that it means potentially multiple byte sequences which produce the same image, which set of sequence grows over time as more hash types are added/supported, as well as that a particular client may or may not be able to determine which sequences are identical depending on which hashes it supports.

However, the nature of arbitrary images is that confusable images are easily produced if the image set is open ended- some graphic payload that looks something like an r could be displayed. Arbitrary images are not necessarily optimized for searching and sorting, etc. They should absolutely be excluded from IDNA, usernames, passwords, etc etc. So, I think that in summary, the ambiguous sequence issue is exacerbated by having multiple hash algorithms, but it is not entirely caused by it. An implementation could simply only allow one hash for a particular entity, and then forever use only that hash and never upgrade it. If a future image was sent with an otherwise colliding hash, the hash algorithm could be upgraded.