add certhash - Githubissues

marten-seemann commented 2 years ago

needed for WebTransport (and potentially WebRTC)

marten-seemann commented 2 years ago

Is "certificatehash" a more appropriate/common name?

That's certainly the case, but it's also longer. Not sure how this should factor into the decision. @aschmahmann @Stebalien any opinions?

lidel commented 2 years ago

WebTransport addrs will already be pretty long, and iiuc there will be more than one certificate hash to facilitate switching to a new hash over time. So.. maybe we could go.. shorter? :trollface:

Since we have:

/tls for Transport Layer Security

we could go with shorter (and generic)

/tlh or /tlch Transport Layer (Cert) Hash?
/wth What The coughs Web Transport Hash?

Stebalien commented 2 years ago

Eh, I don't think the length matters that much. In terms of name, certhash is fine. It's also called a digest or a fingerprint. IMO, fingerprint may make it clear that this is a hash of a certificate/key.

MarcoPolo commented 2 years ago

so .. I went googling because I hadn't heard of "certhash" and go-multihash is the top result for webtransport "certhash".

Seems like a good thing! Search results will point to this rather than give a generic description.

marten-seemann commented 2 years ago

Thanks for all the suggestions! I like fingerprint because that’s the term usually used for certificate hashes, but curiously, it’s not used by WebTransport. I also see the Marco’s SEO argument for certhash. I’ll just merge this with certhash, as we do need to make a decision now.

mxinden commented 2 years ago

Following up on a discussion in https://github.com/multiformats/rust-multiaddr/pull/59

In regards to the discussion on whether a multihash makes sense here, I would argue yes it does due to:

WebRTC supports different hashing algorithms. With multihash we can support the full width of offerings by WebRTC.
There might be the need at some point to change to a different hashing algorithm. Wrapping the hash blob in a multihash makes this easy in the future.

See also recent discussion on the specs pull request https://github.com/libp2p/specs/pull/412#discussion_r917582723

In regards to the discussion on whether a multibase makes sense here, I would argue yes it does, though I would argue less strongly:

Say we ever want to change the base (e.g. due to some weird requirements say of a browser URL), changing it later on is easy.
I find the overhead to be small. All implementations need a multibase implementation anyways, thus it is easy to integrate. I doubt this has any performance impact.
I am very much in favor of being consistent on the base across implementations for now to prevent the above mentioned re-encoding issue.

The reason for using a multibase is the same as for any multi-thing: "We're probably wrong today, but at least we keep our way open to be right tomorrow." If there's no need to decide for one blessed way of doing things, why would we pick one?

I think this is a sane mindset and a good reasoning for going with multihash and multibase. Though I think we should be consistent on the base across implementations for now.

marten-seemann commented 2 years ago

Though I think we should be consistent on the base across implementations for now.

I don't see any advantage to that (and maybe even a slight disadvantage, exercising the flexibility we have does have value). On the wire, it's all bytes anyway. And where we share strings, they'll get passed through a multibase decoder.

mriise commented 2 years ago

I think there really isn't much choice since (for the near to mid future) certhashes will be mostly used for browser comms and will probably just be base64url. Spec-wise it should still be defined multibase + multihash since we really dont want to code/spec ourselves into a corner.

tomaka commented 2 years ago

I'm going to detail my point of view in this comment. I'm explaining in depth what I'm thinking, because I've realized over time that there are some obvious cultural divergences between me and Protocol Labs, in the sense that things that I see as straight forward aren't necessarily for other people and vice versa.

Multiaddresses aren't actually composable

(this is just a preamble, and I think everybody will agree to the content of this section)

As we all know, a multiaddr is a list of protocols, such as /ip4/..., /ws, /quic, etc.

It is easy to think of these protocols are being wrapped inside of each other, for example an IP header followed with a TCP header followed with a WebSocket frame. However, this doesn't represent the truth.

A few examples:

When connecting to /ip4/1.2.3.4/tcp/80/ws, you leave the Host HTTP header empty. But when connecting to /dns/example.com/tcp/80/ws, you include Host: example.com in the HTTP headers.
Similarly, when using /dns/example.com/tcp/443/tls/ws, you must include example.com in the TLS handshake (see https://en.wikipedia.org/wiki/Server_Name_Indication).
/quic also uses TLS, but it isn't explicited in QUIC multiaddresses.
/webrtc should actually be /ip4/.../udp/.../dtls/sctp/0.

Additionally, the protocol stacks that multiaddresses describe are specific to libp2p. This is particularly obvious for /webrtc and /webtransport, as we have libp2p-specific extensions to these protocols, but it is also the case for example for /ws where text frames are forbidden.

To me, and probably to everyone, a multiaddr can be seen as having the same problem as the OSI model: it doesn't represent the truth, but it is close enough to be a useful mental model.

How to choose multiaddr representations

If a multiaddress doesn't represent the truth, then which criteria should we use to decide on the format of multiaddresses?

I believe, and I think most people will agree, that a good multiaddress is a multiaddress that is reasonably intuitive for a human being to understand when reading it. Similarly, a good multiaddress should be easy to write for human beings. When for example you want to connect through WebSocket to 1.2.3.4, it should be easy to write /ip4/1.2.3.4/tcp/80/ws without hesitating too much.

I personally don't see any other rational criteria for choosing multiaddresses formats than its readability/writability by human beings. In particular, given that multiaddresses aren't remotely close to the way protocols actually work, I don't think that a certain multiaddress format can be considered as more easy to implement than another.

Is /certhash intuitive?

The /certhash protocol in my opinion performs very poorly in this regard.

How intuitive something is is obviously subjective, but I don't see how /tls/certhash/<hash> is more intuitive than, say, /tls/<hash>.

If multiaddresses are supposed to convey an idea of stack of protocols, a bit like the OSI model does, then I fail to see what /certhash represents. As far as I can tell, /certhash is the first time in the multiaddr format that the similarity to a stack of protocols is violated.

Is using a multibase+multihash intuitive?

Let's put ourselves in the shoes of a devops that is configuring some kind of software that uses libp2p. This devops person obtains a certificate from a certificate provider, puts it on the machine, and configures the libp2p software to listen on WebRTC using this certificate. Do you think that this devops would be able to easily write down what the multiaddr of this newly-configured server is? They would need to turn the certificate into a multihash+multibase, a pretty complicated operation. Similarly, if a devops person sees /ip4/.../webrtc/certhash/<multibase+hash>, is this person able to verify if this address is indeed correct?

On the other hand, if the hash was the SHA256 hash, the devops could use the built-in UNIX utility sha256sum and put the output in the multiaddr. There is obviously some knowledge that the devops needs to have over multiaddresses, but the best we can do is make their life easier.

Additionally, using a multibase has the consequence that comparing multiaddresses is no longer as simple as comparing their text output. You now need to actually parse the multibase, and check whether their payloads compare equal. This is quite a big complication.

You might say that preferring a format over another because how well a devops is able to read/write it or how one can compare them is a pretty weak argument, but this is IMO literally the only reasonable argument. If the world was dominated by multiformats, maybe it wouldn't be an argument. But the world isn't dominated by multiformats and, let's face it, will never be.

One could also argue that multiaddress formats should be coherent in order to be intuitive, and I agree with that. However there is no precedent (apart from the very new /webtransport) in putting hashes in a multiaddress. So far, the only multiaddresses used in production as far as I know, are plain TCP, WebSocket, and QUIC. We only need be consistent with these three kinds of multiaddresses.

Counter-argument: extensibility

The argument in favor of using a multibase and multihash is "extensibility".

This PR proposes that we add /certhash/<multihash>, knowing that at the moment in practice the <multihash> will always use SHA256.

But there is a pretty obvious other way to extend multiaddresses: adding new protocols.

Rather than add /certhash/<multibase-prefix><multihash-prefix-for-sha256><sha256-hash>, we could alternatively add /certhash-sha256/<sha256-hash>. And if there's the need for SHA512 certificates in the future, instead of adding /certhash/<multibase-prefix><multihash-prefix-for-sha512><sha512-hash> we could add /certhash-sha512/<sha512-hash>.

I understand that using a multihash as opposed to multiple protocols has some small consequences on the code paths in the implementation. However, it would be a mistake to try implement /certhash in a generic way. As explained in the first paragraph, /certhash can only be understood in the context of another protocol such as /webrtc or /webprotocol, and in the context of /webrtc it can only contain a SHA256 hash at the moment. If the WebRTC specification allowed an alternative hash algorithm to be used, then support for this specific hash algorithm would need to be explicitly implemented. In this regard, there is no difference between a multihash and multiple different protocols.

Furthermore, I'd like to claim that leaving the list of hash algorithms extensible is a weakness in the specification. It is incomplete for an implementation to claim to support /webrtc/certhash. In order to be honest, an implementation has to mention that it supports ̀"/webrtc/certhash with SHA256 hashes". Supposing that SHA512 hashes became allowed, it would not be possible to pass a multiaddress containing a SHA512 certificate hash before the implementation has explicitly added support for it. Having a generic multihash as the payload of /certhash is a deliberate refusal to put into the specification something (the list of supported algorithms) that technically must be part of the specifications in order to guaranteed interoperability.

However, importantly, what I want to highlight in this section is not that I prefer /certhash-sha256 over /certhash, but how arbitrary this choice is. To me, there is no advantage of choosing one over another. The only reasonable criteria is how intuitive to write a multiaddress as a whole.

What I would propose

What I would propose for WebRTC is: /webrtc-direct/<sha256-hash>.

Something like /webrtc-direct{fingerprint=...} would IMO be even better, but I'd rather not go in the debate of having parameters.

It is reasonably simple to read and construct, and conveys the idea. It doesn't have any of the drawbacks expressed so far, or at least minimizes them.

Why /webrtc-direct instead of /webrtc? Because we know that we will add alternative WebRTC protocols in the future. I like the fact that /webrtc-direct conveys the idea that this is one opinionated way of connecting through WebRTC, rather than the WebRTC protocol.

I understand that there's a lot of arbitrary in choosing a protocol name, but the important points I want to highlight in this comment is that I think that /certhash is not a good multiaddr format, and that using a multibase+multihash is a misguided decision.

A side note about the vision

I think that what I describe in this comment is a relatively coherent way to design multiaddresses, as opposed to the current way of doing things.

Correct me if I'm wrong, but there really isn't any guiding document regarding multiaddresses. It seems that everybody just adds protocols randomly (choosing an arbitrary protocol number), and there's no argument in favor of specifying protocols in a certain way and not another.

When asked why use a multihash and a multibase, the answer is "well, you never know, for the future". Maybe if there was some kind of vision expressed somewhere we would know if there might be a need for the future. It seems that nobody knows where they're going, and the consequence is a reluctance to close any door. No, that's not the solution. The solution is to express a vision.

mxinden commented 1 year ago

Multiaddresses aren't actually composable

It is easy to think of these protocols are being wrapped inside of each other, for example an IP header followed with a TCP header followed with a WebSocket frame. However, this doesn't represent the truth.

To me, and probably to everyone, a multiaddr can be seen as having the same problem as the OSI model: it doesn't represent the truth, but it is close enough to be a useful mental model.

Agreed with the two points above. It is a good mental model though not consistent across all protocol combinations.

/webrtc should actually be /ip4/.../udp/.../dtls/sctp/0.

Or even /ip4/.../udp/1234/stun given that the remote will speak STUN to this endpoint as first.

How to choose multiaddr representations

I believe, and I think most people will agree, that a good multiaddress is a multiaddress that is reasonably intuitive for a human being to understand when reading it.

Similarly, a good multiaddress should be easy to write for human beings. When for example you want to connect through WebSocket to 1.2.3.4, it should be easy to write /ip4/1.2.3.4/tcp/80/ws without hesitating too much.

Agreed. multiaddresses should be easy to read and write.

I personally don't see any other rational criteria for choosing multiaddresses formats than its readability/writability by human beings.

For me there is also:

Multiaddress is a single data container / format to describe a wide variety of potentially stacked protocols.
At the machine level multiaddresses don't allow for ambiguity (e.g. good old "is this a TCP or UDP port"). Or in other words they don't require context in order to be interpreted correctly.
The multiaddress format can be extended in the future.

As an aside, I am not familiar with any alternatives. Pointers appreciated.

Is /certhash intuitive?

How intuitive something is is obviously subjective, but I don't see how /tls/certhash/ is more intuitive than, say, /tls/.

If I am not mistaken it is not about being intuitive, but about being able to specify multiple certificate hashes in a single multiaddress, allowing to renew an endpoints certificate without downtime. This is needed in WebTransport.

Quote from the WebTransport draft specification:

Certificates

Since most libp2p nodes don't possess a TLS certificate signed by a Certificate Authority, servers use a self-signed certificates. According to the w3c WebTransport certification, the validity of the certificate MUST be at most 14 days, and must not use an RSA key. Nodes then include the hash of one (or more) certificates in their multiaddr (see Addressing).

Servers need to take care to regularly renew their certificate. In the following, the RECOMMENDED logic for rolling certificates is described. At first boot of the node, it creates a self-signed certificate with a validity of 14 days, and publishes a multiaddr containing the hash of that certificate. After 10 days, the node prepares the next certificate, setting the NotBefore date of that certificate to the expiration date (or shortly before that) of the first certificate, and an expiration of 14 days after that. The node continues using the old certificate until the expiry date, but it already advertises a multiaddr containing both certificate hashes. This way, clients will be able to connect to the node, even if they cache the multiaddr for multiple days.

https://github.com/libp2p/specs/pull/404/

Does that make sense @tomaka? Happy for alternative suggestions on how to support the renewal process in WebTransport.

Is using a multibase+multihash intuitive?

This devops person obtains a certificate from a certificate provider, puts it on the machine, and configures the libp2p software to listen on WebRTC using this certificate. Do you think that this devops would be able to easily write down what the multiaddr of this newly-configured server is? They would need to turn the certificate into a multihash+multibase, a pretty complicated operation.

Agreed that this is tedious. I would expect the implementation to print the listening address including the /certhash on standard-out. This output can then be copied by the operator.

Similarly, if a devops person sees /ip4/.../webrtc/certhash/<multibase+hash>, is this person able to verify if this address is indeed correct?

One would need to write a small tool for this. Not ideal in my eyes, though I would argue worth the benefit of being able to easily migrate away from a decision we make today.

You might say that preferring a format over another because how well a devops is able to read/write it or how one can compare them is a pretty weak argument, but this is IMO literally the only reasonable argument.

For the sake of completeness, as listed above, I do see more benefits for the multiaddress format beyond being intuitive to read and write.

One could also argue that multiaddress formats should be coherent in order to be intuitive, and I agree with that. However there is no precedent (apart from the very new /webtransport) in putting hashes in a multiaddress.

If I am not mistaken the new peer ID format using a CID does involve both a multihash and a multibase.

https://github.com/libp2p/specs/blob/master/RFC/0001-text-peerid-cid.md

I have not been involved in this design, thus I might be missing something here.

Counter-argument: extensibility

It is incomplete for an implementation to claim to support /webrtc/certhash. In order to be honest, an implementation has to mention that it supports ̀"/webrtc/certhash with SHA256 hashes".

Correct. The same applies e.g. to the support of peer IDs with the optional support for ECDSA.

https://github.com/libp2p/specs/blob/master/peer-ids/peer-ids.md

Now this in itself is not an argument. (Two wrongs don’t make a right.) I would argue that this is not a big issue. In case we do decide to support an additional hash function, one would first need to roll out the reading and later on the writing side on a live network.

What I would propose

What I would propose for WebRTC is: /webrtc-direct/.

With the above mention of WebTransport needing multiple certificates in the same multiaddress, would you stick with your proposal @tomaka?

Why /webrtc-direct instead of /webrtc? Because we know that we will add alternative WebRTC protocols in the future. I like the fact that /webrtc-direct conveys the idea that this is one opinionated way of connecting through WebRTC, rather than the WebRTC protocol.

I don't have a strong opinion here. I moved the discussion to the WebRTC specification pull request. I hope you don’t mind.

https://github.com/libp2p/specs/pull/412/files#r926205076

Maybe if there was some kind of vision expressed somewhere we would know if there might be a need for the future.

I don't think the two are mutually exclusive. I.e. we can have both a vision and prepare for the fact that this vision will most likely be (partially) wrong in the future.

Happy to give feedback on a vision document, though I can not commit to actually writing one.

tomaka commented 1 year ago

Does that make sense @tomaka? Happy for alternative suggestions on how to support the renewal process in WebTransport.

The alternative suggestion that immediately comes to mind is: why not advertise two (or more) multiaddresses, one for each certificate? It would make it possible to attach to attach to each multiaddress the expiration data of the corresponding certificate. It would also avoid issues such as "how long is the maximum number of /certhashes in a row allowed?", which is also something that this specification should normally mention in order to be complete.

Agreed that this is tedious. I would expect the implementation to print the listening address including the /certhash on standard-out. This output can then be copied by the operator.

Asking the operator to run an implementation is also not great. It means that the operator cannot write a script that generates a certificate and obtains the corresponding multiaddress, which is a problem if you want to automate the deployment of nodes. You could add a tool that computes the multibase+multihash of the certificate, but again we're comparing this with using a well-known built-in Unix utility.

If I am not mistaken the new peer ID format using a CID does involve both a multihash and a multibase.

I have forgotten to include this, because to me a peer ID isn't part of a multiaddress, but this is an off-topic question. The peer ID is actually a good example of how tedious it is for node operators to determine the multiaddress of a node. We have first hand production experience in Parity, actual feedback, about the fact that it is very difficult to compute a peer ID ahead of time or outside of running a client implementation. Let's listen to that feedback and not repeat the same mistakes.

For the sake of completeness, as listed above, I do see more benefits for the multiaddress format beyond being intuitive to read and write.

None of the benefits that you list above are relevant to the discussion about whether /certhash or multibase+multihash are well designed, except for the extensibility.

All your arguments against what I'm saying all boil down to "we use a multibase+multihash because we want to make multiaddresses extensible", but at the same time you don't answer the points in my "Counter-argument: extensibility" section that explain that /certhash and multibase+multihash don't make multiaddresses more extensible.

I don't think the two are mutually exclusive. I.e. we can have both a vision and prepare for the fact that this vision will most likely be (partially) wrong in the future.

I don't think we have the same definition of "vision". To me, the entire point of a vision is to lay down the known knows and the known unknowns, in order to be able to know what is set is stone and what isn't. "Wrong" is neither a "known" nor an "unknown", it is a third category. A vision can be wrong, but a wrong vision should be seen as a systemic failure, and the potential of a wrong vision is not a reason to program defensively.

mxinden commented 1 year ago

Does that make sense @tomaka? Happy for alternative suggestions on how to support the renewal process in WebTransport.

The alternative suggestion that immediately comes to mind is: why not advertise two (or more) multiaddresses, one for each certificate? It would make it possible to attach to attach to each multiaddress the expiration data of the corresponding certificate. It would also avoid issues such as "how long is the maximum number of /certhashes in a row allowed?", which is also something that this specification should normally mention in order to be complete.

I will let @marten-seemann comment on this as the WebTransport expert.

For the sake of completeness, as listed above, I do see more benefits for the multiaddress format beyond being intuitive to read and write.

None of the benefits that you list above are relevant to the discussion about whether /certhash or multibase+multihash are well designed, except for the extensibility.

Correct. Extensibility is the benefit I see in favor of multibase+multihash.

All your arguments against what I'm saying all boil down to "we use a multibase+multihash because we want to make multiaddresses extensible", but at the same time you don't answer the points in my "Counter-argument: extensibility" section that explain that /certhash and multibase+multihash don't make multiaddresses more extensible.

Would you mind enumerating the points I did not address?

In regards to /certhash-sha256 over /certhash, I see the simplification for operators using /certhash-sha256, though still I would argue in favor of /certhash with multibase+multihash for the sake of consistency across the multiformats ecosystem.

In regards to eventually rolling out support for a hash algorithm beyond sha256, I agree that we have to touch the code either way, still I would argue that using multibase+multihash requires us to touch less code.

I don't think the two are mutually exclusive. I.e. we can have both a vision and prepare for the fact that this vision will most likely be (partially) wrong in the future.

I don't think we have the same definition of "vision". To me, the entire point of a vision is to lay down the known knows and the known unknowns, in order to be able to know what is set is stone and what isn't. "Wrong" is neither a "known" nor an "unknown", it is a third category. A vision can be wrong, but a wrong vision should be seen as a systemic failure, and the potential of a wrong vision is not a reason to program defensively.

Unfortunately I agree that we diverge here. That said, I am optimistic that we can find common ground.

marten-seemann commented 1 year ago

The alternative suggestion that immediately comes to mind is: why not advertise two (or more) multiaddresses, one for each certificate? It would make it possible to attach to attach to each multiaddress the expiration data of the corresponding certificate.

The serverCertificateHashes API allows you to pass multiple certificate hashes to the browser, and the browser will then accept the certificate if it matches any of those hashes.

If you had multiple WebTransport multiaddrs, you'd either

have to complete 2 QUIC handshakes, just to see which one succeeds. This is a non-starter.
encode the expiry date into the multiaddr, as you suggest. First of all, I'm not sure if that qualifies as an intuitive solution, but this is likely a matter of taste. Furthermore, it does require quite a bit more logic on the client side. I'm having a hard time wrapping my head around how a reasonable happy-eyeballs ranking of addresses would look like if we introduce this additional error condition, where the transport filters out outdated certificates.

It would also avoid issues such as "how long is the maximum number of /certhashes in a row allowed?", which is also something that this specification should normally mention in order to be complete.

That's actually a feature, not a bug. How many certhashes you include depends on how long you expect an address you once advertised to be cached by potential clients. With 2 certhashes, your address can be cached for (at least) 14 days. With 3 certhashes, you're at (at least) 28 days, etc. For most use cases, 14 days is probably plenty, but it's nice to have the option to extend that period.

tomaka commented 1 year ago

encode the expiry date into the multiaddr, as you suggest.

I don't suggest to encode the expiry date in the multiaddr. I mention that this gives the option to other protocols (similar to Kademlia) to provide an expiry date alongside with a multiaddr, as some kind of metadata. After all, all multiaddresses can become obsolete at one point or another, this is not specific to certificates. The only thing that certificates have in particular is that their expiration date/time is known in advance.

If you had multiple WebTransport multiaddrs, you'd either

Option 3: examine all the multiaddresses known for a peer before trying to connect to it, and pass to the serviceCertificateHashes API all the certificates that are found for this peer in its multitude of multiaddresses.

That's actually a feature, not a bug.

Sorry but I really don't understand this attitude.

The entire point of a specification is to guarantee interoperability. You can't have an implementation accept multiaddresses with a maximum of 3 certificates and other implementations accept multiaddresses with a maximum of 5 certificates, as otherwise a multiaddress with 4 certificates will work on some implementations and not others. The entire point of a specification is that this limit must be known.

marten-seemann commented 1 year ago

I don't suggest to encode the expiry date in the multiaddr. I mention that this gives the option to other protocols (similar to Kademlia) to provide an expiry date alongside with a multiaddr, as some kind of metadata. After all, all multiaddresses can become obsolete at one point or another, this is not specific to certificates. The only thing that certificates have in particular is that their expiration date/time is known in advance.

Attaching a TTL to particular addresses would require a major overhaul of how we advertise addresses, at all places where we advertise addresses (Kademlia, Gossipsub, Rendezvous, mDNS, just to name a few).

The entire point of a specification is to guarantee interoperability. You can't have an implementation accept multiaddresses with a maximum of 3 certificates and other implementations accept multiaddresses with a maximum of 5 certificates, as otherwise a multiaddress with 4 certificates will work on some implementations and not others. The entire point of a specification is that this limit must be known.

I don't see any reason to impose any limit at all. There's no security risk here, there's no way an attacker can exhaustively list all possible hashes. There's also no (additional) DoS risk here, an attacker can already send arbitrarily long multiaddrs.

I'd be happy to add this as a requirement to the spec, something along those lines: "Multiaddress parsers MUST NOT limit the number of certhashes that they accept in a multiaddr". A PR would be appreciated.

tomaka commented 1 year ago

Attaching a TTL to particular addresses would require a major overhaul of how we advertise addresses, at all places where we advertise addresses (Kademlia, Gossipsub, Rendezvous, mDNS, just to name a few).

I'm not suggesting that we overhaul everything. There is no expiration date attached anywhere at the moment, therefore there is no overhaul needed.

The context is comparing the solution of one multiaddress with multiple certificates (without any expiration date attached) with the solution of multiple multiaddresses with one certificate each (without any expiration date attached).

My point is that in the future if we were to attach expiration dates, it would make more sense to attach them to individual multiaddresses, because it's not just certificates that can expire but multiaddresses as a whole.

I don't see any reason to impose any limit at all.

Because computers don't have unlimited memory.

an attacker can already send arbitrarily long multiaddrs.

That seems like something that needs to be fixed.

mxinden commented 1 year ago

Adding my perspective on the above discussion whether to impose a limit on the number of /certhash components in a multiaddress:

I don't think we should impose a limit on the number of /certhash components in a single multiaddress. Instead I think we should impose limits at the higher level, i.e. impose a limit on the size of a single multiaddress or a limit on a set of multiaddresses at the level of the protocol exchanging these multiaddresses, e.g. in the identify protocol. Having a single limit at a higher level seems simpler to me compared to many limits at lower levels.

melekes commented 1 year ago

If the WebRTC specification allowed an alternative hash algorithm to be used, then support for this specific hash algorithm would need to be explicitly implemented.

what about https://datatracker.ietf.org/doc/html/rfc8122#section-5?

hash-func              =  "sha-1" / "sha-224" / "sha-256" /
                             "sha-384" / "sha-512" /
                             "md5" / "md2" / token
                             ; Additional hash functions can only come
                             ; from updates to [RFC 3279](https://datatracker.ietf.org/doc/html/rfc3279)

tomaka commented 1 year ago

what about https://datatracker.ietf.org/doc/html/rfc8122#section-5?

It then mentions that "md2" and "md5" must not be supported, and that "sha-256" is preferred. I personally don't think we should support "sha-1" as it's deprecated. Which comes back to my point: we should be explicitly specifying the hash algorithm or list of hash algorithms that can be used in multiaddresses. I'd go for supporting only sha-256. ...which (to repeat my point) means that using a multihash is unnecessary and adds complexity for no advantage.

melekes commented 1 year ago

Outsider's opinion: @tomaka have raised some very good points 👍 with regards to multiaddresses, certhash and multibase/multihash.

The fact that certhash has to support N hashes and many different formats seem to clash with webrtc requirements where we only need 1 hash and it can be "sha-256" (an option to upgrade is nice imho, although it may be safe to assume? sha-256 will be a default choice for years to come).

I'd also go for /webrtc-direct both because of @tomaka's point and potential new attack vectors found by @ckousik.

melekes commented 1 year ago

fyi: webrtc-rs lib only supports sha-256 and rejects any other hash functions right now.

mxinden commented 1 year ago

I think the recent comments above miss the option of sha-512, right?

https://github.com/multiformats/multiaddr/pull/130#issuecomment-1186540999 suggests the option of introducing /certhash-sh512 in case we ever need it.

I understand that using /certhash-sha256 and /certhash-sha512 makes reading a multiaddress easier. Still I am in favor of /certhash/<multibase><multihash> given its consistency with the multiformats stack and, in my eyes, the easier upgrade path in case we ever want to use other bases and hashes.

Which comes back to my point: we should be explicitly specifying the hash algorithm or list of hash algorithms that can be used in multiaddresses. I'd go for supporting only sha-256.

Instead of specifying the supported hash algorithms in the multiaddress specification, how about explicitly limiting the supported hash algorithms in the webrtc specification (https://github.com/libp2p/specs/pull/412)? E.g. something along the lines of:

libp2p WebRTC implementations MUST NOT use any hash algorithm other than sha-256. Support for other hash algorithms MAY be added in the future by amending this specification through the libp2p specification process.

In case we ever want to support sha512 we would update the WebRTC specification. Such update would need to consider the many implementations and would need to detail an upgrade path for existing networks.

I'd also go for /webrtc-direct both because of @tomaka's point and potential new attack vectors found by @ckousik.

@melekes can you expand on these attack vectors. I am blanking on this, sorry.

tomaka commented 1 year ago

I think the recent comments above miss the option of sha-512, right?

Just because the option exists doesn't mean that we should support it. The reason why we're talking so much about sha-256 is that it is the default algorithm that every implementation of WebRTC must support (see section 5.1 of RFC 8122).

in my eyes

I'm trying to give precise reasons in these comments, rather than just "in my eyes".

https://github.com/multiformats/multiaddr/pull/130#issuecomment-1186540999 suggests the option of introducing /certhash-sh512 in case we ever need it.

As I mention in the same comment, I'm not advocating for /certhash-sha256 and /certhash-sha512. I was explaining why a multibase+multihash do not provide any advantage over different multiaddr protocols.

What I'm advocating for is something like /webrtc-direct/<sha256-hash>, where webrtc-direct directly represents the WebRTC libp2p specification. There is no need to ever want to upgrade to sha512 unless there is suspicion that sha256 might be flawed, which is likely not happening tomorrow, and in which case we can introduce /webrtc-direct-v2 (similar to how there's ipv4 and ipv6).

Instead of specifying the supported hash algorithms in the multiaddress specification, how about explicitly limiting the supported hash algorithms in the webrtc specification (https://github.com/libp2p/specs/pull/412)? E.g. something along the lines of:

This is a good thing to do, but what I'm advocating for is making multiaddresses less confusing.

Such update would need to consider the many implementations and would need to detail an upgrade path for existing networks.

I don't know whether you're presenting this as a negative point, but to me this is a positive point. We want precise and stable protocols with as few corner cases as possible and that are upgraded carefully and infrequently.

melekes commented 1 year ago

@melekes can you expand on these attack vectors. I am blanking on this, sorry.

see https://github.com/libp2p/specs/pull/412#issuecomment-1168970508

There is no need to ever want to upgrade to sha512 unless there is suspicion that sha256 might be flawed, which is likely not happening tomorrow, and in which case we can introduce /webrtc-direct-v2 (similar to how there's ipv4 and ipv6).

good analogy 👍

mxinden commented 1 year ago

Such update would need to consider the many implementations and would need to detail an upgrade path for existing networks.

I don't know whether you're presenting this as a negative point, but to me this is a positive point. We want precise and stable protocols with as few corner cases as possible and that are upgraded carefully and infrequently.

As a positive point. Rephrased: I think it is a good thing to force such change, e.g. sha-512 support, to go through the libp2p specification process, thus keeping the many implementors from the many implementations in the loop and making us think about a good upgrade path.

mxinden commented 1 year ago

https://github.com/libp2p/specs/pull/412/commits/7009f943d5b9ee67cb7a19bbf8ca60549c1ad683 adds the above suggestion, namely to specify sha256 and base64url as the minimum hash algorithm and base encoding.

@melekes can you expand on these attack vectors. I am blanking on this, sorry.

see libp2p/specs#412 (comment)

I don't understand how the hash algorithm and base encoding relates to these attacks. Can you expand on that @melekes?

marten-seemann commented 1 year ago

https://github.com/libp2p/specs/commit/7009f943d5b9ee67cb7a19bbf8ca60549c1ad683 adds the above suggestion, namely to specify sha256 and base64url as the minimum hash algorithm and base encoding.

This seems like overspecifying things. Is there any precedent in our stack where we say "you need to support decoding this multihash and that base encoding?

mxinden commented 1 year ago

I am not aware of a case for multibase and multihash. Though related is the peer ID specification requiring ed25519 for all implementations.

Implementations MUST support Ed25519. Implementations SHOULD support RSA if they wish to interoperate with the mainline IPFS DHT and the default IPFS bootstrap nodes. Implementations MAY support Secp256k1 and ECDSA, but nodes using those keys may not be able to connect to all other nodes.

https://github.com/libp2p/specs/blob/master/peer-ids/peer-ids.md

tomaka commented 1 year ago

This seems like overspecifying things

What is the point of a specification if there isn't even any intersection between what every implementation must support? Why even have a specification at all and not just let every implementation define their own custom multiaddr format for WebTransport and WebRTC, then?

multiformats / multiaddr

add certhash #130

Multiaddresses aren't actually composable

How to choose multiaddr representations

Is /certhash intuitive?

Is using a multibase+multihash intuitive?

Counter-argument: extensibility

What I would propose

A side note about the vision

Multiaddresses aren't actually composable

How to choose multiaddr representations

Is /certhash intuitive?

Certificates

Is using a multibase+multihash intuitive?

Counter-argument: extensibility

What I would propose