ivilata / pymultihash

Python implementation of the multihash specification
19 stars 8 forks source link

Consider adding the identity hash function #6

Closed Stebalien closed 1 year ago

Stebalien commented 5 years ago

go-multihash now supports using the identity "hash" function and go-ipfs uses it to embed small blocks directly into CIDs (content identifiers).

Note: The identity function is special in that it doesn't support truncation (for obvious reasons). Note 2: The identity function is obviously not suitable as a one-way function.

ivilata commented 1 year ago

Hi @Stebalien, sorry for handling this so late, I've been busy with other stuff.

I started adding support for the identity hash when I found this issue:

>>> import multihash
>>> mh = multihash.digest(b'foo', 'identity')
>>> mh.verify(b'baz')
False
>>> mh.verify(b'foo')
True
>>> mh.verify(b'foobar')
True

The weird last result is because, as indicated in the package's docstring:

If you need a shorter multihash, you may truncate it while keeping the initial bytes of the raw hash digest. A byte string validates against a truncated multihash if its digest bytes match the initial bytes of the string's hash:

This behavior is implemented in Multihash.verify().

What I should probably do is changing verify() to just compare that the digest of the given data exactly matches that stored in the instance, which implies that truncated multihashes would be useless for verifying data (though still useful for parsing and producing shortened encoded representations).

However I'm worried that I made verify() behave like that because of some requirement in multihash specs or elsewhere. I haven't been able to find anything like that, but then I haven't found anything about truncation either, so I'd like to ask you first if it'd be ok to change the behavior, before breaking things.

Thanks! :vulcan_salute:

Stebalien commented 1 year ago

We've dealt with this problem in go-multihash by refusing to truncate in multihash.digest for the identity hash function.

ivilata commented 1 year ago

Thanks @Stebalien for the note! Still, even if I forbid truncating identity digests, the current behavior of pymultihash would cause the problem shown above. So I guess I must forbid truncation of the identity hash and compare the multihash literally (and not as a prefix) for verification (i.e. change this to return digest == self.digest).

Stebalien commented 1 year ago

Yes? Don't you verify by hashing the content?

ivilata commented 1 year ago

Yes, Multihash.verify(data) does compute the digest of data according to the function indicated by the multihash instance, then it compares it against the digest in the instance. But currently it just compares that the multihash digest is a prefix of the data digest, so that truncated multihashes can still be used for veifying data.

What I guess I should do instead is just a normal comparison of the multihash digest with the data digest, thus a truncated multihash would fail to validate the data which was used to create its original, full version. This seems to be what go-multihash does indeed (except for identity hashes, which it refuses to truncate as you mentioned).

Stebalien commented 1 year ago

Yes. Specifically, if you were using go-multihash here, you would:

  1. Get the length+codec of the multihash to be verified.
  2. Compute a new hash of the data with that length/codec.
  3. Compare.

In the case of identity hashes, step 2 would fail because the identity hash function would refuse to truncate the data.

ivilata commented 1 year ago

Thanks for the info! A few notes on the behavior of go-multihash for myself:

$ echo -n foo | go run multihash/main.go
QmRJzsvyCQyizr73Gmms8ZRtvNxmgqumxc2KUp71dfEmoj
$ echo -n foo | go run multihash/main.go -c QmRJzsvyCQyizr73Gmms8ZRtvNxmgqumxc2KUp71dfEmoj
OK checksums match (-q for no output)
$ echo -n bar | go run multihash/main.go -c QmRJzsvyCQyizr73Gmms8ZRtvNxmgqumxc2KUp71dfEmoj
error: computed checksum did not match
exit status 1
$ echo -n foo | go run multihash/main.go -l 128
kTLD3bg3qrtRoGHJ78fxaNAX
$ echo -n foo | go run multihash/main.go -c kTLD3bg3qrtRoGHJ78fxaNAX
error: computed checksum did not match
exit status 1
echo -n foo | go run multihash/main.go -c kTLD3bg3qrtRoGHJ78fxaNAX -l 128
OK checksums match (-q for no output)
$ echo -n foo | go run multihash/main.go -a identity
163NSr
$ echo -n foo | go run multihash/main.go -a identity -c 163NSr
OK checksums match (-q for no output)
$ echo -n foo | go run multihash/main.go -a identity -l 16
error: the length of the identity hash (0) must be equal to the length of the data (3)
exit status 1
$ echo -n foo | go run multihash/main.go -a identity -l 24
error: the length of the identity hash (0) must be equal to the length of the data (3)
exit status 1
ivilata commented 1 year ago

So I split the solution in two commits: one to have two different methods of verification (strict comparison of digests vs. prefix comparison) in commit f91f31da, which the user needs to explicitly choose, and 0ed71cca for the proper addition of the identity hash function (0x00 as per multicodec/table.csv). And identity multihashes can't neither be truncated nor verify by prefix.

See examples in the doctests for Multihash.{verify,verify_truncated,truncate}.

I hope that this does the job, please reopen otherwise. Thanks!