multiformats / rust-multiaddr

multiaddr implementation in rust
https://crates.io/crates/multiaddr
Other
86 stars 45 forks source link

Define and use parsed address segment representation inside `Multiaddr` instead of `Vec<u8>` #25

Closed ntninja closed 2 years ago

ntninja commented 6 years ago

This PR rewrites most of src/protocol.rs to provide a companion type for each variant of protocol::Protocol that can store a fully deserialized representation of the given address segment's data.

Known caveats:

  1. The protocol::IPFSSegment type is pretty verbose because it has to replicate some functionality from the cid crate that is not exposed in a reusable way
  2. The Multiaddr main struct also needs to carry a (useless) field with the serialized byte representation of the struct to ensure backwards-compatiblity with the .as_slice() method
  3. The Protocol and Segment enumerations feel somewhat redundant, but I can't think of any good way of unifying them without starting to work around the type system.
  4. Needs tests for new features!

(4) will be fixed (1) I will propably fix at the source myself as well (2) can be fixed any time if you're willing to break backward-compatibility

Any other comments about the code, approach, etc.?

ntninja commented 6 years ago

The Multiaddr main struct also needs to carry a (useless) field with the serialized byte representation of the struct to ensure backwards-compatiblity with the .as_slice() method

I don't agree with this point. Cargo.toml versions and Cargo.lock files could be used to use the old version of the library. imo, api for backwards compatibility is almost always redundant thanks to cargo toolchain

But if one library uses the old version of multiaddr and one library uses a newer, incompatible version, won't that mean that a Multiaddr struct from the old library, won't be able to passed to the library that expects the newer version (linking errors)? Basically any library exposing a Multiaddr in a public interface will have to release a major version as well when switching to the new version of multiaddr, no?

ntninja commented 6 years ago

I just read the WikiPedia page on .onion addresses and realized that each address is 80 bits in length (not bytes!). I've updated the struct and dropped all the custom derive code that was required because of the array with more than 32 entries. (It should have seemed weird before: 80 bytes is longer than SHA512 with 64 bytes!)

dignifiedquire commented 6 years ago

Thank you @alexander255 for all the work here and thank you @debris for the review. I really like this direction but haven't had time to dig through it yet though. I will try and get to it in the next days.

ntninja commented 6 years ago

@dignifiedquire Thanks! I still plan on adding some tests (currently tracking down a bug found by this), before this should be committed through.

ntninja commented 6 years ago

@dignifiedquire: I fixed some stuff, exposed byte stream serializing and parsing in MultiAddr and added two more tests. From my point of view, I'm done here – please review.

Also I did all changes in a backwards compatible way, but if you are willing to break compatibility to get rid of the ugly left-overs caused by this I'd be willing to prepare a separate PR for that. (i.e.: This PR would result in version 0.2.1, and, after the other PR we would bump the version to 0.3.0. Also all methods removed in 0.3.0 would cause deprecation warnings for downstream developers using 0.2.1.)

ntninja commented 6 years ago

Another note: This PR implements a design pretty much opposite to what #11 is describing. Individual address segments should be able to be used as trait objects using the AddressSegment trait and doing so is indeed extensible by crate users, however there is currently no mechanism to use these address segments inside the standard MultiAddr type.

crackcomm commented 6 years ago

@alexander255 have you considered a Custom protocol or registering user-protocols?

Related: #15

ntninja commented 6 years ago

@crackcomm: I have considered adding a special address segment enum variant that can wrap any existing Box<AddressSegment> (even if they are provided by other libraries). Unfortunately however, such a thing is almost entirely useless without any way to store it as part of a MultiAddr object that can then be passed around. There is also no mechsim currently to register custom protocol parsers with the library (i.e. The situation on this is totally unchanged compared to how it was before.)

While all of these things are totally implementable, it basically boils down to a design decision: Do we want rust-multiaddr to implement a closed set of types as efficiantly as possible (this PR based on the previous code) or do want it to be an open system where library users bring their own types and we only define some core defaults (needs a redesign of most of the library code)?

@dignifiedquire: Thoughts on this?

dignifiedquire commented 6 years ago

While having user extension is a nice feature, currently for multiaddr the process is to propose a new address format in https://github.com/multiformats/multiaddr and after it lands there it is adopted by implementations. So I think it is fine to have it implemented in the perf oriented way that is done now.

So given that I would like to see concrete use cases for those custom handlers before changing the architecture of the module to enable it.

crackcomm commented 6 years ago

@dignifiedquire good, it is fine, for now.

daviddias commented 6 years ago

@tomaka mind taking the lead and reviewing this PR?

tomaka commented 6 years ago

I don't know if this PR is the appropriate place to discuss this, but I've gone a different way in our local fork: https://github.com/libp2p/rust-libp2p/tree/cace5bf577c53809aad514db40e72f7290e030d4/rust-multiaddr

In libp2p, a multiaddr is generally stored somewhere in memory, is often sent/received over the network, and is often hashed and compared with other multiaddresses, but it is not often decoded. You only need to decode a multiaddr when you want to open a new connection to it, which doesn't happen that often. Therefore I went with keeping a private Vec<u8> inside the Multiaddress struct, because it is both the most space/cache-local-efficient memory representation and the most efficient when it comes to comparing and hashing.

In addition to this, our local fork adds an AddrSegment enum which is very similar to the Segment enum of this PR. My reasoning, however, is that if you really want strong typing then you might as well just store a Vec<AddrSegment>. A Multiaddr can easily be decoded into an iterator of AddrSegment, and doesn't perform any additional verification compared to a Vec<AddrSegment>.

The only major downside, however, is that with a Vec<u8> the pop() operation needs to iterator over the whole array of bytes. With a Vec<AddrSegment> you can simply remove the last element. However in my opinion some benchmarking would needed, because reading the poped AddrSegment may require a memory fetch, which may counter-balance the small overhead.

dignifiedquire commented 2 years ago

closing this, as the impl stores the byte representation now, which is for many use cases quite good. happy to reconsider if there is a concrete demonstrable perf or correctness issue.