Closed Stebalien closed 4 years ago
Somewhat related https://github.com/libp2p/go-libp2p-crypto/issues/51.
Very related: https://github.com/libp2p/go-libp2p/issues/484.
Current representation of PeerIDs is base58, which is case-sensitive and thus not Origin-safe. In the context of IPFS CIDs switching to base32 the time feels right to solve Origin problem for IPNS websites as well.
So the ask from Web Browsers WG is:
http://<keyid-b32>.ipns.dweb.link
To me having a peer ID that is anything more than the hash of a public key would feel very over-engineered.
To me having a peer ID that is anything more than the hash of a public key would feel very over-engineered.
So, at the very least, we need to support multiple hash functions. That's what this is about.
The idea behind using CIDs is that would bring this in-line with everything else. In addition to being able to upgrade the hash function, we'd:
Of course, the other side of this issue is that we'd be making libp2p depend on IPLD and I understand that the libp2p team may not be happy with that. So, let's try to not tie all of these together.
Concrete non-CID proposal (kill two birds with one stone):
Note 1: 5 is not a network-breaking change as these won't cross the network anyways.
My concern is that this does make switching to CID-backed peer IDs painful. Again, we don't need that to be a network-breaking change as we can continue using raw multihash CIDs on the network (for now) however, technically, we could run into issues if we ever bump the CID version number to something that's a valid multihash codec.
Concrete CID proposal (kill three birds with a planet):
This doesn't add full CID-peer-ID support. Instead, it just gives us a way to convert between them and allows us to use CIDs in text (and gives us some room to upgrade later).
Using CIDs has another benefit. As we move to accepting multiple keys, we will eventually need to think of a key as a file that might have multiple GBs and using the same identified that we use for content (therefore it being called Content Identified) will enable us to support that smoothly. More notes on https://github.com/ipfs/specs/issues/58
Concrete CIDs all the way proposal (that's no moon):
The primary reason to go with this over the "kill three birds with a planet" scheme is that that scheme only works as long as a CID version number never conflicts with a multihash code (other than 0). Once one does, we won't be able to distinguish between a PIDv0 and a CIDvN.
I don't think the text format of a peer ID should be relevant to the libp2p specs at all. This spec is about how things are transmitted over the network, and the peer ID is never transmitted in text format as far as I know.
Transmitting peer IDs as CIDs over the network would be wasteful, as you're only ever going to require raw data anyway if you don't want to run into tons of crazy corner cases.
If IPFS or IPLD or Filecoin want to format peer IDs as CIDs, they can do so and decode the ID into a multihash before using it in libp2p. However that would not be relevant to the libp2p specs.
We also need to mention that connecting to a peer whose public key reports a different hash function that the one we have should be treated as a key mismatch, even if the hash itself is correct.
Supporting multiple hash functions raises the question of the number of kbuckets in Kademlia, however I'm not an expert enough to know if this is actually a problem, or if it's attackable.
I very much want to preserve public key inlining in the identity, as it is a key property for pusbub message signing to be efficient. Can we make sure we have inline keys from the get-go with whatever scheme we end up adopting?
I don't think the text format of a peer ID should be relevant to the libp2p specs at all. This spec is about how things are transmitted over the network, and the peer ID is never transmitted in text format as far as I know.
Well, given that this is a system for humans, I'd have to strongly disagree. There would be no web without URLs/URIs.
These IDs do show up in text all the time. Usually in the form /p2p/...
(well, right now we usually use /ipfs/Qm...
but we're working on switching to /p2p/...
to remove the association with ipfs).
Transmitting peer IDs as CIDs over the network would be wasteful, as you're only ever going to require raw data anyway if you don't want to run into tons of crazy corner cases.
CIDs give us flexibility. With raw multihashes, we're stuck with the current key format and can't change it. With CIDs, we can introduce new formats unambiguously (because the CID itself states the key format).
Supporting multiple hash functions raises the question of the number of kbuckets in Kademlia, however I'm not an expert enough to know if this is actually a problem, or if it's attackable.
KAD re-hashes everything anyways using sha2-256.
We also need to mention that connecting to a peer whose public key reports a different hash function that the one we have should be treated as a key mismatch, even if the hash itself is correct.
That would be a mismatch. We'd be mapping two names (peer IDs) onto two different peers. We never want to end up in a situation where we assign more than one peer ID to a single, logical, identity.
Can we make sure we have inline keys from the get-go with whatever scheme we end up adopting?
We can keep support, we just need to avoid creating keys with inlined peer IDs by default.
These IDs do show up in text all the time. Usually in the form /p2p/... (well, right now we usually use /ipfs/Qm... but we're working on switching to /p2p/... to remove the association with ipfs).
They show up only if you use the IPFS, Filecoin or IPNS CLI. As far as I understand, libp2p is supposed to be purely a library. Things built on top of libp2p can decide whether to a multibase and convert that to a multihash before passing to libp2p.
CIDs give us flexibility. With raw multihashes, we're stuck with the current key format and can't change it. With CIDs, we can introduce new formats unambiguously (because the CID itself states the key format).
The peer ID is the hash of an arbitrary vector of bytes. Right now this underlying vector of bytes is a protobuf struct, but could be changed to be anything.
KAD re-hashes everything anyways using sha2-256.
What do you mean "re-hashes"? My concern is: if I perform a Kademlia FIND_NODE RPC query, and a remote send back a result containing SHA512 multihash, what is our local node supposed to do? Extend itself to 512 kbuckets?
What do you mean "re-hashes"?
We treat peer IDs as opaque keys and always hash them with sha256 (like we do with other keys). Basically, we make no assumptions about the peer ID format when figuring out where a peer ID fits in the DHT keyspace.
As far as I understand, libp2p is supposed to be purely a library. Things built on top of libp2p can decide whether to a multibase and convert that to a multihash before passing to libp2p.
Libp2p is a library but it's also an interoperable network protocol. Peer IDs need interoperable, human-readable forms just like IP addresses, MAC addresses, etc.
The peer ID is the hash of an arbitrary vector of bytes. Right now this underlying vector of bytes is a protobuf struct, but could be changed to be anything.
You'd have to somehow communicate how the this "arbitrary byte vector" should be decoded and how it should be understood as a public key. We currently use the codec in CIDs to communicate this kind of information.
Libp2p is a library but it's also an interoperable network protocol. Peer IDs need interoperable, human-readable forms just like IP addresses, MAC addresses, etc.
But why do they need need a human-readable form, since they are only ever transmitted as bytes over the network?
I'm not saying their human-readable form shouldn't be defined. I'm saying that the network protocol shouldn't be changed because of out-of-scope human-readable reasons.
We treat peer IDs as opaque keys and always hash them with sha256 (like we do with other keys).
Does that mean that if a remote says that its public key should be hashed with sha512, then we hash it with sha256 anyway for DHT insertion purposes?
But then, if we receive a Kademlia RPC query, do we answer with sha256 hashes, or do we answer with sha512 hashes? Or both (the protocol can't do both at the moment)? If we answer with the sha256 hash, then we have the problem of multiple peer IDs pointing to the same thing. If we answer with the sha512 hash, then the remote has to connect to the peer first in order to determine its public key and hash it as sha256. If we answer with both, we need to modify the Kademlia protocol, and also the remote can't verify that the hashes actually correspond to the same public key.
Hey, can't quite pull up all the context to this at the moment, but my intuition in the past has been to make PeerIDs a subset of CIDs (so that the meaning of "CID" would change to "Cryptographic IDentifier"), and to do much of the same CIDv0/CIDv1 dance we did in content-land. It'd avoid duplication of concepts/protocols, and it'd be a migration that we already have some experience in.
It would also be super useful to do the same base32 thing that we're doing with content CIDs, so we can seamlessly use the peer CIDs in URLs.
I sense the discussion above is already going into that direction, and I can't add anything new to that at the moment, so I'll leave it at that :)
But why do they need need a human-readable form, since they are only ever transmitted as bytes over the network?
They get put into multiaddrs all the time (/p2p/<peerid>
or legacy /ipfs/<peerid>
), and any multiaddr protocol must have a human-readable representation. Addresses get logged in libp2p itself, and blessed libp2p tools have UIs as well (e.g. daemon, testing tools, etc.).
They get put into multiaddrs all the time (/p2p/
or legacy /ipfs/ ), and any multiaddr protocol must have a human-readable representation.
That's just moving the problem. Multiaddresses are also only ever transmitted as binary, and thus don't need to have their human-readable version defined as part of the network protocol.
I'm extremely confused as to why it's a good idea to make machine A decide in which format a peer id should be displayed on machine B, and not just let machine B decide.
Addresses get logged in libp2p itself, and blessed libp2p tools have UIs as well (e.g. daemon, testing tools, etc.).
That's not related to the wire protocol, which is what this is all about.
Some notes from skimming this:
@tomaka
To me having a peer ID that is anything more than the hash of a public key would feel very over-engineered.
What about having the peerID actually be the public key? I think that its unarguable how useful that is. How would you go about representing that distinction over the wire?
Kademlia hashes the binary encoded form of the peerID with sha256. The contents of the peer ID don't change this at all. The fact that peerIDs are currently the sha256 hash of the encoded public key has no effect on the fact that kademlia uses a sha256 hash. The keys here are just binary blobs.
Does that mean that you store both the peerID and the sha256 of the public key? I raised a few more questions in the comment above.
How would you go about representing that distinction over the wire?
I'm not sure I understand the question. We know whether we receive a peer ID or a public key based on the protocol. For example secio
and identify
send us a public key, while relay
sends us peer IDs.
fragmentation in the human readable format for addresses smells bad to me. I think everyone agrees about how to write a human readable ipv4 address, Different systems don't render those differently (to the best of my knowledge). I'm not saying different systems can't choose to do that, they most certainly can, but I think that choosing to do so would cause more confusion to users.
I agree that everything should have the same human-readable representation, but I don't see how it is a good solution to let a remote node on the network choose what the representation of their key should be.
I'm not saying their human-readable form shouldn't be defined. I'm saying that the network protocol shouldn't be changed because of out-of-scope human-readable reasons.
I completely agree. That's why I listed three options:
Why bundle these questions together? Unfortunately, wire format changes tend to affect display and parsing.
I agree that everything should have the same human-readable representation, but I don't see how it is a good solution to let a remote node on the network choose what the representation of their key should be.
That was never the intention here. The key's creator needs to be able to choose the hash function used in the peer ID so they don't end up with multiple peer IDs (not for display, for code) however, they don't get to, e.g., decide on the base encoding the user decides to use.
That was never the intention here. The key's creator needs to be able to choose the hash function used in the peer ID so they don't end up with multiple peer IDs (not for display, for code) however, they don't get to, e.g., decide on the base encoding the user decides to use.
I guess I misunderstood then. The CID specs describe a CID as having a multibase, which is what I was talking about all along: https://github.com/ipld/cid#how-does-it-work---protocol-description I guess you wouldn't have that multibase on the wire? (but then shouldn't it be called something else than a CID?)
I would still love to get some clarifications on how Kademlia would work after allowing hashes other than SHA-256.
I guess you wouldn't have that multibase on the wire? (but then shouldn't it be called something else than a CID?)
Technically, it does. However, we don't actually do this anywhere (well, except for the CBOR format) because it's redundant (and I'm certainly not going to argue for it).
I would still love to get some clarifications on how Kademlia would work after allowing hashes other than SHA-256.
Kademlia uses sha-256 and only sha-256 for all keys. Given a peer ID ID = SOME_HASH(publicKey)
, we map this into kademlia's key-space with SHA256(ID)
.
Note: This does mean that SOME_HASH
needs to be deterministic for a given key (otherwise the DHT position changes). That's why this proposal adds a HashFunction
field to the public key datastructure.
@tomaka
We know whether we receive a peer ID or a public key based on the protocol. For example secio and identify send us a public key, while relay sends us peer IDs.
No, thats what i'm saying. In some cases, the peer ID is the public key. its not sent separately.
Also, on kademlia:
As @Stebalien says, no matter what kind of peer ID you have, kademlia will take the sha256 hash of it. Think of it like a normal hash table, where the peer ID is the 'key'. If i have a hash table, the hash table doesnt care if my key is a string of zeros, a blake2b hash, my name, or the first 32 byte of a picture of my cat. It's going to hash the key with its hash function, and place the value into the bucket (alongside the key itself). So whether or not a peerID is sha256, or blake2, or a raw public key, kademlia will apply a sha256 hash over the top of it in all cases.
Kademlia uses sha-256 and only sha-256 for all keys. Given a peer ID ID = SOME_HASH(publicKey), we map this into kademlia's key-space with SHA256(ID).
I see, thanks.
For what it's worth, my inner feeling loudly screams that using a CID is a very bad idea, but since this is purely a feeling I get from my general IT experience I'm going to stop arguing so that things can move forward.
Another hopefully last small question I have is related to this:
I very much want to preserve public key inlining in the identity, as it is a key property for pusbub message signing to be efficient.
Since each node generates their own public key, and therefore decides which hashing algorithm is to be used, how would you guarantee that they're using the identity hash?
You can't modify the hashing algorithm in the public key protobuf, otherwise you would no longer have peer IDs unity.
And what about modifying the pubsub protocols instead, in order to transmit the protobuf of the public key instead of its hashed version (which to me would make more sense and be much less intrusive to libp2p as a whole)?
Since each node generates their own public key, and therefore decides which hashing algorithm is to be used, how would you guarantee that they're using the identity hash? You can't modify the hashing algorithm in the public key protobuf, otherwise you would no longer have peer IDs unity.
Nodes would have to choose the appropriate hash function when they generate their key (recording the choice in the key itself). Changing the hash function will, effectively, change the peer's identity.
And what about modifying the pubsub protocols instead, in order to transmit the protobuf of the public key instead of its hashed version (which to me would make more sense and be much less intrusive to libp2p as a whole)?
We currently include both but you're right, we could just include the key without the ID. Really, for signatures, we can pretty much always do this.
Nodes would have to choose the appropriate hash function when they generate their key (recording the choice in the key itself). Changing the hash function will, effectively, change the peer's identity.
Again I'm not sure I understand the answer.
The main goal of this change is to be able to use the identity
hashing for public keys and make signatures verification more straight forward. But nothing guarantees that nodes will choose identity
and not whatever hash they want. If we enforce that nodes must always choose identity
, then there is no point in making in flexible.
@tomaka identity hashing for keys only works when the key is small enough to embed in a peer ID. RSA keys, and other types of quantum resistant key algorithms are way too big to be embedded.
If we enforce that nodes must always choose identity, then there is no point in making in flexible.
We currently enforce this if the key is < 42 bytes. However, that's causing issues for open bazaar. The proposal here is to add an explicit "hash function" field to let the creator of the key choose.
A peer may choose not to use the identity hash function with their tiny key. That's up to them. If they don't, we'll have to do things the old way.
However, that's causing issues for open bazaar.
I don't think I've seen any mention of what these issues exactly are. The only thing I see mentioned is that programmers were calling the wrong function and causing bugs, which to me is more an engineering issue than a specifications issue and not a valid reason to change the specs.
A peer may choose not to use the identity hash function with their tiny key. That's up to them. If they don't, we'll have to do things the old way.
Isn't that a concern? That a node's behaviour and performances depend on what potentially malicious remote node decides to do?
Just a note for this thread. We don't currently have anything running in production that uses inline ed25519 keys. We use the sha256 hash of an ed25519 key. Inline keys are something we'd like to switch to as failure to find the public key is a common problem when trying to send messages to nodes that haven't been online in a while.
I don't think I've seen any mention of what these issues exactly are.
The issue was that the peer IDs changed under their feet.
Timeline:
This proposal is to:
This also allows users to choose non-sha256/identity hash functions for their peer IDs.
Isn't that a concern? That a node's behaviour and performances depend on what potentially malicious remote node decides to do?
Not really. They could also just use a rsa8192 key.
this shouldn't have been closed!
As a quick discussion checkpoint: what are the next steps/blockers?
(Personally, I am mainly interested in unblocking https://github.com/ipfs/go-ipfs/issues/5287)
Step 1 is that someone on both the JS team and the GO needs to own this. I've added it to the libp2p team week planning document.
On the go side, it may be a bit tricky due to the current go-libp2p-peer and go-libp2p-crypto dependency inversion. The key (from go-libp2p-crypto) will need to remember the hash function to generate the peer. However, all the logic for generating peer IDs lives in go-libp2p-peer. We'll probably have to move it to go-libp2p-crypto and then re-export it from go-libp2p-peer.
I feel strongly about this, so I can own. I want thoz inlined keys!
What's the opinion on my alternative suggestion, which is to modify floodsub/gossipsub to transmit the non-hashed public key instead?
it already does that when the key is not inlined in the id.
Oh I see: https://github.com/libp2p/go-libp2p-pubsub/commit/d2f7a664d2f748867eb0d1f7465419a4887c45cd#diff-9b6fbbdd6fa2676b0d87d60d96f4b556 I was not aware of this change, and it looks like the JS version isn't up-to-date: https://github.com/libp2p/js-libp2p-floodsub/blob/355ab300cd3b5765b46fd17f96eb81187c0ba7e6/src/message/rpc.proto.js.
Why is inlining in the PeerId
necessary, then?
it's not necessary, it is desirable!
I'm still extremely confused. This change strongly appears to me like an XY problem, and I cannot figure out why this change is wanted at all. All along I thought that it was because there was no way of verifying the signature of a pubsub message, but now that you point out to me that this is not the reason, I can't see why.
we would like to avoid the overhead of having to transmit the key.
Why not transmit the key and remove the peer ID from the message instead? It can easily be recomputed from the key, and the overhead would be the same as inlining the key in the ID.
that's a backwards incompatible change, we can't do that.
You can have a new version of floodsub
and support both the old and new versions for a while.
This is the similar to having newer nodes sending you the inline key and old nodes sending you the hashed key.
IMO, the original motivations for embedding the key in the peer ID has now mostly been obsoleted by simply embedding the key in the record (which I believe is the "more correct" solution). We can make the peer ID optional and eventually stop including it in favor of simply deriving it from the key. However, we should still support alternative hash functions for peer IDs:
In the beginning, peer IDs were base58 encoded sha256 multihashes. However, being able to embed a key into a peer ID was rather tempting so we introduced a IDFromEd25519PublicKey function. This function created a different peer ID that embedded the public key.
Unfortunately,
IDFromPublicKey
function, they'd get a different peer ID for the same key. This obviously isn't good as our security transports derive peer IDs from public keys instead of validating peer IDs (this makes it impossible to mis-validate).Therefore, I submitted https://github.com/libp2p/go-libp2p-peer/pull/30 which:
Unfortunately, this last bit appears to be a problem... At the time, I was under the impression that nobody had actually gotten around to using ed25519 keys as inlining didn't really work (libp2p would still use the sha256 peer IDs). However, it turns out that OpenBazaar was using them anyways.
Second, in retrospect, this wasn't even the optimal solution. Really, we also need to support arbitrary hash functions but this only adds support for the identity hash function.
So, my proposal here is to:
Open question: Should we use this as an opportunity to switch to multibase-enabled peer IDs now? One significant issue with the current embedding method is that it makes it slightly harder to switch to multibase peer IDs (cc @lidel). I've previously proposed switching to CID-based peer IDs but that may be more work than it's worth (a motivation would be transfering keys over bitswap, easier to support large quantum-safe keys, format flexibility, etc.).