libp2p / specs

Technical specifications for the libp2p networking stack
https://libp2p.io
1.58k stars 275 forks source link

Specify peer ID hash function in public key #111

Closed Stebalien closed 4 years ago

Stebalien commented 5 years ago

In the beginning, peer IDs were base58 encoded sha256 multihashes. However, being able to embed a key into a peer ID was rather tempting so we introduced a IDFromEd25519PublicKey function. This function created a different peer ID that embedded the public key.

Unfortunately,

  1. Users had to remember to call this.
  2. If a user called the normal IDFromPublicKey function, they'd get a different peer ID for the same key. This obviously isn't good as our security transports derive peer IDs from public keys instead of validating peer IDs (this makes it impossible to mis-validate).

Therefore, I submitted https://github.com/libp2p/go-libp2p-peer/pull/30 which:

  1. Removed IDFromEd25519PublicKey.
  2. Changed inline peer IDs to just inline the protobuf (instead of introducing an entirely new format).
  3. Auto-inlined peer ids <= 42 bytes.

Unfortunately, this last bit appears to be a problem... At the time, I was under the impression that nobody had actually gotten around to using ed25519 keys as inlining didn't really work (libp2p would still use the sha256 peer IDs). However, it turns out that OpenBazaar was using them anyways.

Second, in retrospect, this wasn't even the optimal solution. Really, we also need to support arbitrary hash functions but this only adds support for the identity hash function.


So, my proposal here is to:

  1. Revert that change. I'm pretty sure OpenBazaar is the only ed25519 user in the wild so that should be safe.
  2. Introduce a new field in the public/private key protobuf that explicitly specifies the hash function (defaulting to "sha256" if missing).

Open question: Should we use this as an opportunity to switch to multibase-enabled peer IDs now? One significant issue with the current embedding method is that it makes it slightly harder to switch to multibase peer IDs (cc @lidel). I've previously proposed switching to CID-based peer IDs but that may be more work than it's worth (a motivation would be transfering keys over bitswap, easier to support large quantum-safe keys, format flexibility, etc.).

Stebalien commented 5 years ago

Somewhat related https://github.com/libp2p/go-libp2p-crypto/issues/51.

Very related: https://github.com/libp2p/go-libp2p/issues/484.

lidel commented 5 years ago

Current representation of PeerIDs is base58, which is case-sensitive and thus not Origin-safe. In the context of IPFS CIDs switching to base32 the time feels right to solve Origin problem for IPNS websites as well.

So the ask from Web Browsers WG is:

tomaka commented 5 years ago

To me having a peer ID that is anything more than the hash of a public key would feel very over-engineered.

Stebalien commented 5 years ago

To me having a peer ID that is anything more than the hash of a public key would feel very over-engineered.

So, at the very least, we need to support multiple hash functions. That's what this is about.

The idea behind using CIDs is that would bring this in-line with everything else. In addition to being able to upgrade the hash function, we'd:

Of course, the other side of this issue is that we'd be making libp2p depend on IPLD and I understand that the libp2p team may not be happy with that. So, let's try to not tie all of these together.

Stebalien commented 5 years ago

Concrete non-CID proposal (kill two birds with one stone):

  1. Remove (for now) support for embedding ed25519 keys.
  2. Add support for parsing multibase-prefixed peer IDs (in text).
  3. Add read support for multiple hash functions but tell people to not use other hash functions yet.
  4. :hourglass_flowing_sand: :hourglass: :hourglass_flowing_sand: :hourglass:
  5. Switch to formatting peer IDs using multibase by default.
  6. Expose support for multiple hash functions. All new ed25519 keys specify the "identity" hash function by default.

Note 1: 5 is not a network-breaking change as these won't cross the network anyways.


My concern is that this does make switching to CID-backed peer IDs painful. Again, we don't need that to be a network-breaking change as we can continue using raw multihash CIDs on the network (for now) however, technically, we could run into issues if we ever bump the CID version number to something that's a valid multihash codec.

Stebalien commented 5 years ago

Concrete CID proposal (kill three birds with a planet):

  1. Remove (for now) support for embedding ed25519 keys.
  2. Add support for parsing CID-v1 peer IDs (in text). In binary, we store them as multihashes.
  3. Add read support for multiple hash functions but tell people to not use other hash functions yet.
  4. :hourglass_flowing_sand: :hourglass: :hourglass_flowing_sand: :hourglass:
  5. Switch to formatting peer IDs as CIDs (in text) by default.
  6. Expose support for multiple hash functions. All new ed25519 keys specify the "identity" hash function by default.

This doesn't add full CID-peer-ID support. Instead, it just gives us a way to convert between them and allows us to use CIDs in text (and gives us some room to upgrade later).

daviddias commented 5 years ago

Using CIDs has another benefit. As we move to accepting multiple keys, we will eventually need to think of a key as a file that might have multiple GBs and using the same identified that we use for content (therefore it being called Content Identified) will enable us to support that smoothly. More notes on https://github.com/ipfs/specs/issues/58

Stebalien commented 5 years ago

Concrete CIDs all the way proposal (that's no moon):

  1. Remove (for now) support for embedding ed25519 keys.
  2. Add support for parsing CID-v1 peer IDs (in text).
  3. Add a concept of PIDv0. That is, all peer IDs for libp2p-crypto protobuf keys that don't specify their hash function get encoded to a multihash on the network. Others get encoded to PIDv1.
  4. Add read support for multiple hash functions but tell people to not use other hash functions yet.
  5. :hourglass_flowing_sand: :hourglass: :hourglass_flowing_sand: :hourglass:
  6. Switch to formatting peer IDs as CIDs by default.
  7. Add support for multiple hash functions.
  8. Eventually, start (a) specifying SHA256 in key protobufs by default (forcing them to use PIDv1). At the same time, switch to ed25519 by default because it's better in every way anyways.

The primary reason to go with this over the "kill three birds with a planet" scheme is that that scheme only works as long as a CID version number never conflicts with a multihash code (other than 0). Once one does, we won't be able to distinguish between a PIDv0 and a CIDvN.

tomaka commented 5 years ago

I don't think the text format of a peer ID should be relevant to the libp2p specs at all. This spec is about how things are transmitted over the network, and the peer ID is never transmitted in text format as far as I know.

Transmitting peer IDs as CIDs over the network would be wasteful, as you're only ever going to require raw data anyway if you don't want to run into tons of crazy corner cases.

If IPFS or IPLD or Filecoin want to format peer IDs as CIDs, they can do so and decode the ID into a multihash before using it in libp2p. However that would not be relevant to the libp2p specs.

tomaka commented 5 years ago

We also need to mention that connecting to a peer whose public key reports a different hash function that the one we have should be treated as a key mismatch, even if the hash itself is correct.

Supporting multiple hash functions raises the question of the number of kbuckets in Kademlia, however I'm not an expert enough to know if this is actually a problem, or if it's attackable.

vyzo commented 5 years ago

I very much want to preserve public key inlining in the identity, as it is a key property for pusbub message signing to be efficient. Can we make sure we have inline keys from the get-go with whatever scheme we end up adopting?

Stebalien commented 5 years ago

I don't think the text format of a peer ID should be relevant to the libp2p specs at all. This spec is about how things are transmitted over the network, and the peer ID is never transmitted in text format as far as I know.

Well, given that this is a system for humans, I'd have to strongly disagree. There would be no web without URLs/URIs.

These IDs do show up in text all the time. Usually in the form /p2p/... (well, right now we usually use /ipfs/Qm... but we're working on switching to /p2p/... to remove the association with ipfs).

Transmitting peer IDs as CIDs over the network would be wasteful, as you're only ever going to require raw data anyway if you don't want to run into tons of crazy corner cases.

CIDs give us flexibility. With raw multihashes, we're stuck with the current key format and can't change it. With CIDs, we can introduce new formats unambiguously (because the CID itself states the key format).

Supporting multiple hash functions raises the question of the number of kbuckets in Kademlia, however I'm not an expert enough to know if this is actually a problem, or if it's attackable.

KAD re-hashes everything anyways using sha2-256.

We also need to mention that connecting to a peer whose public key reports a different hash function that the one we have should be treated as a key mismatch, even if the hash itself is correct.

That would be a mismatch. We'd be mapping two names (peer IDs) onto two different peers. We never want to end up in a situation where we assign more than one peer ID to a single, logical, identity.

Can we make sure we have inline keys from the get-go with whatever scheme we end up adopting?

We can keep support, we just need to avoid creating keys with inlined peer IDs by default.

tomaka commented 5 years ago

These IDs do show up in text all the time. Usually in the form /p2p/... (well, right now we usually use /ipfs/Qm... but we're working on switching to /p2p/... to remove the association with ipfs).

They show up only if you use the IPFS, Filecoin or IPNS CLI. As far as I understand, libp2p is supposed to be purely a library. Things built on top of libp2p can decide whether to a multibase and convert that to a multihash before passing to libp2p.

CIDs give us flexibility. With raw multihashes, we're stuck with the current key format and can't change it. With CIDs, we can introduce new formats unambiguously (because the CID itself states the key format).

The peer ID is the hash of an arbitrary vector of bytes. Right now this underlying vector of bytes is a protobuf struct, but could be changed to be anything.

tomaka commented 5 years ago

KAD re-hashes everything anyways using sha2-256.

What do you mean "re-hashes"? My concern is: if I perform a Kademlia FIND_NODE RPC query, and a remote send back a result containing SHA512 multihash, what is our local node supposed to do? Extend itself to 512 kbuckets?

Stebalien commented 5 years ago

What do you mean "re-hashes"?

We treat peer IDs as opaque keys and always hash them with sha256 (like we do with other keys). Basically, we make no assumptions about the peer ID format when figuring out where a peer ID fits in the DHT keyspace.

As far as I understand, libp2p is supposed to be purely a library. Things built on top of libp2p can decide whether to a multibase and convert that to a multihash before passing to libp2p.

Libp2p is a library but it's also an interoperable network protocol. Peer IDs need interoperable, human-readable forms just like IP addresses, MAC addresses, etc.

The peer ID is the hash of an arbitrary vector of bytes. Right now this underlying vector of bytes is a protobuf struct, but could be changed to be anything.

You'd have to somehow communicate how the this "arbitrary byte vector" should be decoded and how it should be understood as a public key. We currently use the codec in CIDs to communicate this kind of information.

tomaka commented 5 years ago

Libp2p is a library but it's also an interoperable network protocol. Peer IDs need interoperable, human-readable forms just like IP addresses, MAC addresses, etc.

But why do they need need a human-readable form, since they are only ever transmitted as bytes over the network?

I'm not saying their human-readable form shouldn't be defined. I'm saying that the network protocol shouldn't be changed because of out-of-scope human-readable reasons.

tomaka commented 5 years ago

We treat peer IDs as opaque keys and always hash them with sha256 (like we do with other keys).

Does that mean that if a remote says that its public key should be hashed with sha512, then we hash it with sha256 anyway for DHT insertion purposes?

But then, if we receive a Kademlia RPC query, do we answer with sha256 hashes, or do we answer with sha512 hashes? Or both (the protocol can't do both at the moment)? If we answer with the sha256 hash, then we have the problem of multiple peer IDs pointing to the same thing. If we answer with the sha512 hash, then the remote has to connect to the peer first in order to determine its public key and hash it as sha256. If we answer with both, we need to modify the Kademlia protocol, and also the remote can't verify that the hashes actually correspond to the same public key.

ghost commented 5 years ago

Hey, can't quite pull up all the context to this at the moment, but my intuition in the past has been to make PeerIDs a subset of CIDs (so that the meaning of "CID" would change to "Cryptographic IDentifier"), and to do much of the same CIDv0/CIDv1 dance we did in content-land. It'd avoid duplication of concepts/protocols, and it'd be a migration that we already have some experience in.

It would also be super useful to do the same base32 thing that we're doing with content CIDs, so we can seamlessly use the peer CIDs in URLs.

I sense the discussion above is already going into that direction, and I can't add anything new to that at the moment, so I'll leave it at that :)

But why do they need need a human-readable form, since they are only ever transmitted as bytes over the network?

They get put into multiaddrs all the time (/p2p/<peerid> or legacy /ipfs/<peerid>), and any multiaddr protocol must have a human-readable representation. Addresses get logged in libp2p itself, and blessed libp2p tools have UIs as well (e.g. daemon, testing tools, etc.).

tomaka commented 5 years ago

They get put into multiaddrs all the time (/p2p/ or legacy /ipfs/), and any multiaddr protocol must have a human-readable representation.

That's just moving the problem. Multiaddresses are also only ever transmitted as binary, and thus don't need to have their human-readable version defined as part of the network protocol.

I'm extremely confused as to why it's a good idea to make machine A decide in which format a peer id should be displayed on machine B, and not just let machine B decide.

Addresses get logged in libp2p itself, and blessed libp2p tools have UIs as well (e.g. daemon, testing tools, etc.).

That's not related to the wire protocol, which is what this is all about.

whyrusleeping commented 5 years ago

Some notes from skimming this:

@tomaka

To me having a peer ID that is anything more than the hash of a public key would feel very over-engineered.

What about having the peerID actually be the public key? I think that its unarguable how useful that is. How would you go about representing that distinction over the wire?

tomaka commented 5 years ago

Kademlia hashes the binary encoded form of the peerID with sha256. The contents of the peer ID don't change this at all. The fact that peerIDs are currently the sha256 hash of the encoded public key has no effect on the fact that kademlia uses a sha256 hash. The keys here are just binary blobs.

Does that mean that you store both the peerID and the sha256 of the public key? I raised a few more questions in the comment above.

How would you go about representing that distinction over the wire?

I'm not sure I understand the question. We know whether we receive a peer ID or a public key based on the protocol. For example secio and identify send us a public key, while relay sends us peer IDs.

fragmentation in the human readable format for addresses smells bad to me. I think everyone agrees about how to write a human readable ipv4 address, Different systems don't render those differently (to the best of my knowledge). I'm not saying different systems can't choose to do that, they most certainly can, but I think that choosing to do so would cause more confusion to users.

I agree that everything should have the same human-readable representation, but I don't see how it is a good solution to let a remote node on the network choose what the representation of their key should be.

Stebalien commented 5 years ago

I'm not saying their human-readable form shouldn't be defined. I'm saying that the network protocol shouldn't be changed because of out-of-scope human-readable reasons.

I completely agree. That's why I listed three options:

  1. Option 1 solves the multibase issue. This is purely a display issue.
  2. Let's just forget about option 2. That was trying to partially switch to CIDs while punting the hard work till later. However, it provided little to none of the benefits of option 3 while introducing complexity not present in option 1.
  3. Option 3 solves the multibase issue and then "futureproofs" peer IDs: the codec in the CID gives us room to add entirely new key formats (not just new key algorithms using the same protobuf format). This isn't about display at this point.

Why bundle these questions together? Unfortunately, wire format changes tend to affect display and parsing.

  1. If we support multiple hash functions but punt on this, we can't adopt multibase unambiguously (because a base58 encoded multihash may conflict with a multibase prefix).
  2. If we go with option 1 and then decide that we'd like to switch to full CIDs somewhere down the road, we may run into cases where it's difficult to determine if something is a raw multihash or a CID. If we switch now, we know that the only raw multihashes are sha2-256 multihashes so this is pretty easy (we already have to do this in ipfs/ipld so it's nothing new).

I agree that everything should have the same human-readable representation, but I don't see how it is a good solution to let a remote node on the network choose what the representation of their key should be.

That was never the intention here. The key's creator needs to be able to choose the hash function used in the peer ID so they don't end up with multiple peer IDs (not for display, for code) however, they don't get to, e.g., decide on the base encoding the user decides to use.

tomaka commented 5 years ago

That was never the intention here. The key's creator needs to be able to choose the hash function used in the peer ID so they don't end up with multiple peer IDs (not for display, for code) however, they don't get to, e.g., decide on the base encoding the user decides to use.

I guess I misunderstood then. The CID specs describe a CID as having a multibase, which is what I was talking about all along: https://github.com/ipld/cid#how-does-it-work---protocol-description I guess you wouldn't have that multibase on the wire? (but then shouldn't it be called something else than a CID?)

tomaka commented 5 years ago

I would still love to get some clarifications on how Kademlia would work after allowing hashes other than SHA-256.

Stebalien commented 5 years ago

I guess you wouldn't have that multibase on the wire? (but then shouldn't it be called something else than a CID?)

Technically, it does. However, we don't actually do this anywhere (well, except for the CBOR format) because it's redundant (and I'm certainly not going to argue for it).

I would still love to get some clarifications on how Kademlia would work after allowing hashes other than SHA-256.

Kademlia uses sha-256 and only sha-256 for all keys. Given a peer ID ID = SOME_HASH(publicKey), we map this into kademlia's key-space with SHA256(ID).

Note: This does mean that SOME_HASH needs to be deterministic for a given key (otherwise the DHT position changes). That's why this proposal adds a HashFunction field to the public key datastructure.

whyrusleeping commented 5 years ago

@tomaka

We know whether we receive a peer ID or a public key based on the protocol. For example secio and identify send us a public key, while relay sends us peer IDs.

No, thats what i'm saying. In some cases, the peer ID is the public key. its not sent separately.

Also, on kademlia:

As @Stebalien says, no matter what kind of peer ID you have, kademlia will take the sha256 hash of it. Think of it like a normal hash table, where the peer ID is the 'key'. If i have a hash table, the hash table doesnt care if my key is a string of zeros, a blake2b hash, my name, or the first 32 byte of a picture of my cat. It's going to hash the key with its hash function, and place the value into the bucket (alongside the key itself). So whether or not a peerID is sha256, or blake2, or a raw public key, kademlia will apply a sha256 hash over the top of it in all cases.

tomaka commented 5 years ago

Kademlia uses sha-256 and only sha-256 for all keys. Given a peer ID ID = SOME_HASH(publicKey), we map this into kademlia's key-space with SHA256(ID).

I see, thanks.

tomaka commented 5 years ago

For what it's worth, my inner feeling loudly screams that using a CID is a very bad idea, but since this is purely a feeling I get from my general IT experience I'm going to stop arguing so that things can move forward.

tomaka commented 5 years ago

Another hopefully last small question I have is related to this:

I very much want to preserve public key inlining in the identity, as it is a key property for pusbub message signing to be efficient.

Since each node generates their own public key, and therefore decides which hashing algorithm is to be used, how would you guarantee that they're using the identity hash?

You can't modify the hashing algorithm in the public key protobuf, otherwise you would no longer have peer IDs unity.

And what about modifying the pubsub protocols instead, in order to transmit the protobuf of the public key instead of its hashed version (which to me would make more sense and be much less intrusive to libp2p as a whole)?

Stebalien commented 5 years ago

Since each node generates their own public key, and therefore decides which hashing algorithm is to be used, how would you guarantee that they're using the identity hash? You can't modify the hashing algorithm in the public key protobuf, otherwise you would no longer have peer IDs unity.

Nodes would have to choose the appropriate hash function when they generate their key (recording the choice in the key itself). Changing the hash function will, effectively, change the peer's identity.

And what about modifying the pubsub protocols instead, in order to transmit the protobuf of the public key instead of its hashed version (which to me would make more sense and be much less intrusive to libp2p as a whole)?

We currently include both but you're right, we could just include the key without the ID. Really, for signatures, we can pretty much always do this.

tomaka commented 5 years ago

Nodes would have to choose the appropriate hash function when they generate their key (recording the choice in the key itself). Changing the hash function will, effectively, change the peer's identity.

Again I'm not sure I understand the answer.

The main goal of this change is to be able to use the identity hashing for public keys and make signatures verification more straight forward. But nothing guarantees that nodes will choose identity and not whatever hash they want. If we enforce that nodes must always choose identity, then there is no point in making in flexible.

whyrusleeping commented 5 years ago

@tomaka identity hashing for keys only works when the key is small enough to embed in a peer ID. RSA keys, and other types of quantum resistant key algorithms are way too big to be embedded.

Stebalien commented 5 years ago

If we enforce that nodes must always choose identity, then there is no point in making in flexible.

We currently enforce this if the key is < 42 bytes. However, that's causing issues for open bazaar. The proposal here is to add an explicit "hash function" field to let the creator of the key choose.

A peer may choose not to use the identity hash function with their tiny key. That's up to them. If they don't, we'll have to do things the old way.

tomaka commented 5 years ago

However, that's causing issues for open bazaar.

I don't think I've seen any mention of what these issues exactly are. The only thing I see mentioned is that programmers were calling the wrong function and causing bugs, which to me is more an engineering issue than a specifications issue and not a valid reason to change the specs.

A peer may choose not to use the identity hash function with their tiny key. That's up to them. If they don't, we'll have to do things the old way.

Isn't that a concern? That a node's behaviour and performances depend on what potentially malicious remote node decides to do?

cpacia commented 5 years ago

Just a note for this thread. We don't currently have anything running in production that uses inline ed25519 keys. We use the sha256 hash of an ed25519 key. Inline keys are something we'd like to switch to as failure to find the public key is a common problem when trying to send messages to nodes that haven't been online in a while.

Stebalien commented 5 years ago

I don't think I've seen any mention of what these issues exactly are.

The issue was that the peer IDs changed under their feet.

Timeline:

  1. We added ed25519 support.
  2. OpenBazaar started using them.
  3. We started inlining keys <= 42 bytes (including ed25519 keys) into peer IDs.
  4. OpenBazaar tried upgrading libp2p and all their peer IDs changed (from sha256 hashes to "identity" hashes).

This proposal is to:

  1. Revert 3.
  2. Add an explicit flag in keys so nodes can opt-in (when they're first initialized) to inlining their key into their peer ID.

This also allows users to choose non-sha256/identity hash functions for their peer IDs.

Isn't that a concern? That a node's behaviour and performances depend on what potentially malicious remote node decides to do?

Not really. They could also just use a rsa8192 key.

vyzo commented 5 years ago

this shouldn't have been closed!

lidel commented 5 years ago

As a quick discussion checkpoint: what are the next steps/blockers?

(Personally, I am mainly interested in unblocking https://github.com/ipfs/go-ipfs/issues/5287)

Stebalien commented 5 years ago

Step 1 is that someone on both the JS team and the GO needs to own this. I've added it to the libp2p team week planning document.

On the go side, it may be a bit tricky due to the current go-libp2p-peer and go-libp2p-crypto dependency inversion. The key (from go-libp2p-crypto) will need to remember the hash function to generate the peer. However, all the logic for generating peer IDs lives in go-libp2p-peer. We'll probably have to move it to go-libp2p-crypto and then re-export it from go-libp2p-peer.

vyzo commented 5 years ago

I feel strongly about this, so I can own. I want thoz inlined keys!

tomaka commented 5 years ago

What's the opinion on my alternative suggestion, which is to modify floodsub/gossipsub to transmit the non-hashed public key instead?

vyzo commented 5 years ago

it already does that when the key is not inlined in the id.

tomaka commented 5 years ago

Oh I see: https://github.com/libp2p/go-libp2p-pubsub/commit/d2f7a664d2f748867eb0d1f7465419a4887c45cd#diff-9b6fbbdd6fa2676b0d87d60d96f4b556 I was not aware of this change, and it looks like the JS version isn't up-to-date: https://github.com/libp2p/js-libp2p-floodsub/blob/355ab300cd3b5765b46fd17f96eb81187c0ba7e6/src/message/rpc.proto.js.

Why is inlining in the PeerId necessary, then?

vyzo commented 5 years ago

it's not necessary, it is desirable!

tomaka commented 5 years ago

I'm still extremely confused. This change strongly appears to me like an XY problem, and I cannot figure out why this change is wanted at all. All along I thought that it was because there was no way of verifying the signature of a pubsub message, but now that you point out to me that this is not the reason, I can't see why.

vyzo commented 5 years ago

we would like to avoid the overhead of having to transmit the key.

tomaka commented 5 years ago

Why not transmit the key and remove the peer ID from the message instead? It can easily be recomputed from the key, and the overhead would be the same as inlining the key in the ID.

vyzo commented 5 years ago

that's a backwards incompatible change, we can't do that.

tomaka commented 5 years ago

You can have a new version of floodsub and support both the old and new versions for a while. This is the similar to having newer nodes sending you the inline key and old nodes sending you the hashed key.

Stebalien commented 5 years ago

IMO, the original motivations for embedding the key in the peer ID has now mostly been obsoleted by simply embedding the key in the record (which I believe is the "more correct" solution). We can make the peer ID optional and eventually stop including it in favor of simply deriving it from the key. However, we should still support alternative hash functions for peer IDs:

  1. If we ever need to encrypt to a peer without a session (e.g., in a high-latency packet protocol), we may need the key up-front. In that case, we can save a round-trip if we can extract it from the peer ID.
  2. We may want to eventually move away from sha256 for either security or performance reasons. That means we need to somehow support alternative hashing algorithms.