Open amstocker opened 9 years ago
multicodec, too. ipld as well.
On Thu, Nov 19, 2015 at 8:02 PM Andrew Stocker notifications@github.com wrote:
Before moving on to the rest of libp2p and py-ipfs, we should look at the required generic modules and determine which existing implementation we want to use, or if there isn't one, who wants to work on it. Here is what we need (see #21 (comment) https://github.com/ipfs/py-ipfs/issues/21#issuecomment-158019175 for more info):
- PeerID
- PeerInfo
- multihash
- multihashing
- multiaddr
- specs: https://github.com/jbenet/multiaddr
- go impl: https://github.com/jbenet/go-multiaddr
- js impl: https://github.com/jbenet/js-multiaddr
- python impl: https://github.com/amstocker/python-multiaddr
- multistream
I know we already have several multihash implementation https://github.com/ipfs/py-ipfs/issues/13, and I hacked together a multiaddr implementation https://github.com/amstocker/python-multiaddr that could use some love. If anyone has already worked on any of these or is interested on working on any of them, please comment here. Also if anyone has anymore links to references or specs for the generic modules, post them here.
— Reply to this email directly or view it on GitHub https://github.com/ipfs/py-ipfs/issues/23.
@jbenet thanks
Regarding the multihash/multihashing split, it appears our discussion in #13 is now obsolete. We agreed on using a multihash implementation that followed the Python hashlib API, but current favourite JulienPalard's is more of a multihashing implementation, in that it exposes a high level Python-hashlib-like interface to create multihash values and hexdigests, but doesn't implement the low level go-multihash/js-multihash interface.
The multihash/multihashing split seems also very useful if we are to share as much of the testing as possible, maybe through sharness or some other platform-independent system.
This is my suggestion:
Multihashing:
digest()
and hexdigest()
, but we could also implement utility functions base58btc_digest()
, etc.Multihash:
encode()
, decode()
, check()
, etc. should be just that, module functions. At most the decoded multihash could be a namedtuple, for attribute access of "hashcode, name, length, digest", but the output of multihashing.hash('sha1', 'hello')
should be a binary string.encode()
, decode()
and check()
such a value.If people agree to the above points, I could do this work myself. I'm already doing it on Elixir.
@amstocker: I'll look into your multiaddr starter over the weekend. Maybe we can meet on IRC after the libp2p hangout, which I plan to lurk on.
The rationale for a drop-in substitution only works one way anyway: that is, a multihash should be able to be used anywhere that a hashlib digest is, but not in reverse, because a naked digest will never be a legal multihash. This means multihashing should implement digest() and hexdigest(), but we could also implement utility functions base58btc_digest(), etc.
I can't completly agree, if we're diverging from hashlib
by adding methods like base58_btc_digest
, we're in fact completly dropping the drop-in replacement of hashlib
, because another drop-in replacement of hashlib
implementation will probably not name the extended method the same way, like base58_digest
instead. They're no longer drop-in-replacable.
On the other hand, I don't think it's the responsibility of an hashlib to encode (in any bases like base58). I mean, one may want to use hashlib but don't care about encoding in base58, and one may use a base58 encoder without needing hashlib.
I don't think you're limited to one "drop in replacement" interface. What I often do is write the interface I want, and then write another object that wraps it to provide some other useful interface.
from multihashing.hashlib import multihash
Or something more concise :)
@candeira that sgtm.
If the 2 modules thing sounds cumbersome, another option is to use subpackages in python. I prefer super modular things (hence breaking it up into two modules) but the py and go communities don't in general.
@JulienPalard: I don't know that we are talking about the same thing when we mention encoding. It has to do with the view of whether a separation is needed between the functionality of multihash and of multihashing.
multihash
would be the low level library that doesn't know anything about crypto, but only about the multihash datastructure. Thus, multihash.encode()
is just a packer of a (hashcode, length, digest)
in to a single binary struct/string, after validation of inputs. It doesn't do any hashing of the content data, because it's input is already the binary digest from a hash function.multihashing
would be the higher level library that takes some data and he name for a hash function, and returns a digest packed into a multihash value. This is the one that you want to be compatible with Python's hashlib
, a goal I agreed with at first, now not so much.Your multihash
package is actually an implementation of of multihashing
that doesn't consider a standalone multihash
necessary, and implements the necessary functionality of multihash
without exposing the low level API.
However, after reading about the difference between multihash
and multihashing
, there are several reasons why I think it's important to expose both:
multihash
, not of multihashing
multihashing
can implement common standard hashes, there is a place in the multihash
spec for application implementors to just add their own. They would use multihash
to pack the multihash value.multihash
instead of just using slicing the value mh[2:]
.As to the necessity of a base58btcdigest()
encoder in multihashing
, it's the standard human readable string encoder for /ipfs/<multihash>
uris, so at should be at least as important and necessary as hexdigest()
was considered to be in hashlib
when it was first designed. However, this is not as important as having multihashing.verify()
, for instance, to be able to easily verify that a blob retrieved via a multihash key indeed hashes to that value.
In short, having a hashlib
compatible API is useful, but restrincting ourselves to only the hashlib
compatible API denies our users of needed functionality. After all, hashes will be compared and verified much more often than created.
I think there is a design that can please everyone, and I plan to offer an implementation soon. Here it is:
multihashing
├──__init__.py
├── multihashing.py # JulienPalard's high level library, refactored to use the low level API, and extended with verify(), truncation, etc.
├── multihash.py # tehmaze's low level multihash packing library
└── hashlib.py # multihashing, restricted to expose a cloned API of stdlib's hashlib
This fulfills everybody's needs and requirements:
# multihashing.multihashing is the high level utility module
# it allows to create default hashes (including base58 encoded ones),
# truncated hashes, verify blobs against hashes, etc.
>>> from multihashing import multihashing
>>> m1 = multihashing.sha1()
>>> m1.update(b'foo')
>>> m1.digest()
b'\x11\x14\x0b\xee\xc7\xb5\xea?\x0f\xdb\xc9]\r\xd4\x7f<[\xc2u\xda\x8a3'
>>> m1.digest(length=10) # we make truncated hashes by passing a shorter length than default digest length
b'\x11\x0a\x0b\xee\xc7\xb5\xea?\x0f\xdb\xc9]'
>>> m1.hexdigest()
11140beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33
>>> m1.base58btcdigest()
5dqx43zNtUUbPj97vJhpHyUUPyrmXG
>>> multihashing.verify(b'\x11\x14\x0b\xee\xc7\xb5\xea?\x0f\xdb\xc9]\r\xd4\x7f<[\xc2u\xda\x8a3', b'foo')
True
>>> multihashing.verify('5dqx43zNtUUbPj97vJhpHyUUPyrmXG', b'foo', 'base58btc')
True
# multihashing.multihash is the low level pack-and-unpack library
>>> from multihashing import multihash
>>> m2 = multihash.encode('sha1', 20, b'\x11\x14\x0b\xee\xc7\xb5\xea?\x0f\xdb\xc9]\r\xd4\x7f<[\xc2u\xda\x8a3')
>>> m2_truncated = multihash.encode('sha1', 10, b'\x11\x14\x0b\xee\xc7\xb5\xea?\x0f\xdb\xc9]\r\xd4\x7f<[\xc2u\xda\x8a3') # we make truncated hashes by passing a shorter length than default digest length, this is how multihashing truncates
>>> multihash.check(m2) # explicit checking returns boolean
True
>>> multihash.decode(m2) # implicit checking throws exception if not a well formed multihash
DecodedMultihash(code=127, name='sha1', length=20, digest=b'\x0b\xee\xc7\xb5\xea?\x0f\xdb\xc9]\r\xd4\x7f<[\xc2u\xda\x8a3')
# multihashing.hashlib exposes a hashlib compatible API and nothing else,
# and only creates hashes with default lengths and either bytestring or hex encoding for digests
>>> from multihashing import hashlib
>>> m3 = multihashing.sha1()
>>> m3.update(b'foo')
>>> print(m3.hexdigest())
11140beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33
>>> print(m3.base58btcdigest())
AttributeError: etc.
>>> print(m3.digest(lenght=10)
TypeError: digest() takes no arguments (1 given)
# multihashing.hashlib and multihashing.multihashing objects can be compared
>>> m1.digest() == m2
True
>>> m3.digest() == m2
True
>>> m1 == m3
True
>>> m2_truncated = m1.digest(lenght=10)
True
@jbenet I guess you were using 'modules' in a Go sense. This proposal suggests having one single python package with three independent Python modules, ie three .py files. I don't see any reason to split it further.
I still don't get the difference between multihash
and multihashing
and still think the names are badly choosen. However I'm aware that my multihash implementation don't provide a way to unhash
a hash (I mean, "parse it"). But if we have to implement it, why not calling it "unmultihash" ?
It looks like I miseed your message https://github.com/ipfs/py-ipfs/issues/23#issuecomment-158641195, so I'm re-replying.
I still think that multihash
and multihashing
are very badly choosen names, and people will confuse them every time they need one. And I don't think that we need to choose bad names before a javascript implementation choose bad names.
On the other hand I'm aware that sticking on a pure hashlib
compatible API is very restrictive.
But, adding a base58encode
method to multihash
seems wrong to me, there's a problem of responsibility: It's not the hashing library responsibility to provide human readable representation of them. If you go this way, it's also the responsibility of the hashing library to display it on a graphical user interface as a texbox, and also the responsibility of the hasing library to find a blob from its hash in a merkle tree ? No.
I still agree that we need a base58 library, but not only for the hasing representation, probably for some other representations too.
It all boils up to choosing between:
import multihash
import base58
print(base58.encode(multihash.hash(...)))
and
import multihash
print(multihash.hash(...).base58encode())
As I can understand it, multihashing
may be a "human API" on top of "multihash" that does what it's cool to exploit it like providing a base58encode
and some other usefull methods. A facade, for humans. In this sense it's not "that bad", still the name are very confusing and will cost a lot of time for everybody (I mean "import multihash ... multihash. ... .base58encode() oh it throws, it's multihashing, I did it again...). And our points of view on the separation of responsibilities diverges clearly here.
A long message, to propose almost nothing better, that's bad. real bad, sorry. But before choosing names, we have to synchronize on what we want those two differents libs to do (what reponsibilities do they get). We're not simply copying a javascript implementation.
I think that the first one makes a lot more sense to me, but I would change base58
to mirror python's base64
module:
import multihash
from base58 import b58encode, b58decode
print(b58encode(multihash.hash(...)))
In general I think we should have these basic modules:
and then these modules for multihash:
All the multihash modules can be part of py-multihash
, and I don't think the naming matters very much between multihash/multihashing.
For hashlib
compatibility we only need to provide a single class, and in my opinion it still makes sense to provide some additional functionality related with multihash in the same module.
For instance (and taking much inspiration from the examples above), a single multihash.py
module may be created that provides a hashlib
-compatible class (hash
) and a structured utility class (Multihash
) in the same place:
>>> import multihash
>>> hash = multihash.hash('sha1') # hashlib-compatible object
>>> hash.update(b'foo')
>>> digest = hash.digest()
>>> digest
b'\x11\x14\x0b\xee\xc7\xb5\xea?\x0f\xdb\xc9]\r\xd4\x7f<[\xc2u\xda\x8a3'
>>> import base58
>>> base58.b58encode(digest) # basic way to encode a digest
'5dqx43zNtUUbPj97vJhpHyUUPyrmXG'
>>> mh = multihash.Multihash.from_digest(digest) # parse the digest
>>> mh
multihash.Multihash(func=<multihash.Func.sha1: 17>, length=20, digest=b'\x0b...')
>>> mh.encode('base58') # same result as above
'5dqx43zNtUUbPj97vJhpHyUUPyrmXG'
# Truncating while encoding may be supported
# but it breaks ``Mh.decode(mh.encode(X), X) == mh``,
# thus I prefer creating a different Multihash explicitly.
>>> mh.encode('base58', length=10)
'KegLCXHeZVecdh4Y'
>>> mh_trunc = mh.truncate(10) # truncate to a new multihash
>>> mh_trunc
multihash.Multihash(func=<multihash.Func.sha1: 17>, length=10, digest=b'\x0b...')
>>> mh_trunc.encode('base58') # same result as above
'KegLCXHeZVecdh4Y'
>>> mh_dec = multihash.Multihash.decode('KegLCXHeZVecdh4Y', 'base58') # decoding
>>> mh_dec
multihash.Multihash(func=<multihash.Func.sha1: 17>, length=10, digest=b'\x0b...')
>>> mh_dec == mh_trunc
True
# No ``multihash.check()``, use ``from_digest()`` and catch ``ValueError``.
>>> mh.verify(b'foo') # verification
True
>>> mh_trunc.verify(b'foo')
True
The Multihash
class above could well be implemented as a namedtuple
for memory savings. I used an Enum
for the hash function as a comfortable and compact way to have the function code and name together, although numbers and strings may be accepted as well:
>>> multihash.Multihash(multihash.Func.sha1, 20, b'...') # function by enum
multihash.Multihash(func=<multihash.Func.sha1: 17>, length=20, digest=b'...')
>>> multihash.Multihash(0x11, 20, b'...') # function by code
multihash.Multihash(func=<multihash.Func.sha1: 17>, length=20, digest=b'...')
>>> multihash.Multihash('sha1', 20, b'...') # function by name
multihash.Multihash(func=<multihash.Func.sha1: 17>, length=20, digest=b'...')
>>> multihash.Multihash(0x04, 16, b'...') # user-defined hashing function
multihash.Multihash(func=4, length=16, digest=b'...')
>>> multihash.Multihash(0x15, 16, b'...') # unknown hashing function
ValueError: ...
Top-level utility functions may be defined to use encoded multihash strings straight away instead of Multihash
(e.g. mh_digest == b'\x11\x14\x0b...'
and mh_string == '5dqx...'
:
multihash.encode(mh_digest, coding[, length]) -> mh_string
multihash.decode(mh_string, coding) -> mh_digest
multihash.verify(mh_string, coding, data) -> bool
But I don't much see the point of these last functions.
After reviewing the previous proposals I've noticed that it may not make sense to implement multihash as a hashlib
(PEP 247) compatible object, since it's not providing any new hashing method on its own. Multihash is just a way to encode the results of other hashes in a binary packaging, even the re-encoding part (with base64 and the likes) would not be necessary, only comfortable to have. With that in mind I've come up with a leaner version of the proposal:
>>> import hashlib
>>> hash = hashlib.sha1() # use hashlib for proper hashing
>>> hash.update(b'foo')
>>> digest = hash.digest()
>>> digest
b'\x0b\xee\xc7\xb5\xea?\x0f\xdb\xc9]\r\xd4\x7f<[\xc2u\xda\x8a3'
>>> import multihash
>>> mh = multihash.Multihash.from_hash(hash)
>>> mh
multihash.Multihash(func=<multihash.Func.sha1: 17>, length=20, digest=b'\x0b...')
>>> mh.encode() # binary multihash encoding
b'\x11\x14\x0b\xee\xc7\xb5\xea?\x0f\xdb\xc9]\r\xd4\x7f<[\xc2u\xda\x8a3'
>>> import base58
>>> bytes(base58.b58encode(mh.encode())) # plain way to ASCII-armor
b'5dqx43zNtUUbPj97vJhpHyUUPyrmXG'
>>> mh.encode('base58') # just a shortcut to the operation above
b'5dqx43zNtUUbPj97vJhpHyUUPyrmXG'
>>> mh_trunc = mh.truncate(10) # truncate to a new multihash
>>> mh_trunc
multihash.Multihash(func=<multihash.Func.sha1: 17>, length=10, digest=b'\x0b...')
>>> mh_trunc.encode('base58')
b'KegLCXHeZVecdh4Y'
# Truncating while encoding may be supported
# but ``decode(mh.encode(X, L)) != decode(mh.encode(X))``,
# thus I prefer creating a different Multihash explicitly.
>>> mh.encode('base58', length=10) # same result as above
b'KegLCXHeZVecdh4Y'
>>> mh_bin = multihash.decode(mh.encode()) # binary encode and decode
>>> mh == mh_bin
True
>>> mh_dec = multihash.decode(b'KegLCXHeZVecdh4Y', 'base58')
>>> mh_dec == mh_trunc
True
>>> mh.verify(b'foo')
True
>>> mh_trunc.verify(b'foo')
True
>>> multihash.hash(hashlib.sha1, b'foo') == mh
True
# no multihash.check(), use multihash.decode() and catch ValueError
Ongoing implementation of the proposal just above in https://github.com/ivilata/pymultihash.
We're overlapping, ivilata. My ongoing implementation is at https://github.com/candeira/py-multihash.
Hi @candeira, my implementation is based on my proposal above, which departs from your multihash/multihashing proposal to focus on something more pythonic (to my taste, of course!). It is currently only lacking some error checks and an introductory module docstring with some comprehensive usage examples beyond function doctests. I hope that the complete doc can offer a better feeling for the module and help us decide which parts to reuse, merge or drop. No worries if the whole thing is merged or dropped altogether. :)
Thanks!
I've uploaded pymultihash to PyPi. It's feature complete, with some testing (using doctests) and a pretty complete tutorial in the package docstring (which I'm working to put through Sphinx). You're very welcome to have a look at it and give your opinions, thanks! Edit: docs available in ReadTheDocs.
@ivilata this is really awesome!!
Really good! I'm still struggling to find time to finish mine. More to come.
Thanks! In general I'd like to make a case for not trying to mimic the Go or JS and go for something more pythonic where possible. :)
I found another multihash implementation: https://github.com/kalmi/py-multihash
@Fak3 YAYYYYYYYYYY ! #irony
@Fak3 there's a few out there but afaik the one made by @ivilata is the most fleshed out: https://github.com/ivilata/pymultihash
There is an additional multiaddr implementation by @sbuss present as well: https://github.com/sbuss/py-multiaddr.
Before moving on to the rest of libp2p and py-ipfs, we should look at the required generic utilities and determine which existing implementation we want to use, or if there isn't one, who wants to work on it. Here is what we need (see https://github.com/ipfs/py-ipfs/issues/21#issuecomment-158019175 for more info):
pymultihash
delegates this mostly tohashlib
)I know we already have several multihash implementations, and I hacked together a multiaddr implementation that could use some love. If anyone has already worked on any of these or is interested on working on any of them, please comment here. Also if anyone has anymore links to references or specs for the generic modules, post them here.