Discuss: Embrace binary representation of ipfs/libp2p/ipld primitive like CID, Multiaddr, PeerId

Gozala commented 3 years ago

We've discussed this idea on various ephemeral channels before, but as @vasco-santos pointed out it would be a lot more useful to have this discussion in the issue.

Problem statement

We have bunch of user space primitives like CID, Multiaddr, PeerId and they all come with predicates like isCID and are basically wrapped Uint8Array with bunch of methods for common tasks.

This approach has following properties

💔 Creating instance of such a primitive introduces dependency on the library.
💔 Moving things across threads requires encode/decode.
💚 Read API such as (toString, toV1, ...) does not introduce dependency.
💔 Type assertion is optimistic and utilized special symbols. There is also no obvious and fast way to distinguish between e.g. binary CID and Multiaddr.
💔Producer and consumer of the primitive need to agree on API version. That is producer may use older version of library that has not implemented a method that consumer expects to call because it assumes a newer version.

We address this for CIDs with CID.asCID(cid) which basically takes actual bytes and of CID and wraps it in a CID API that consumer expects to use.

Alternatives

An alternative approach could instead embrace binary representation of those primitives and provide an APIs to work with via corresponding libraries multiformats, peer-id, etc...

It is true that this shifts dependency requirement from a producer to a consumer and I have no data to suggest that it would be a net win. But here some hypothesis:

💚 Creating instances of primitives from over the wire input will not require library.
💚 Moving instances across threads will not require anything it would just be a same primitive.
💔 Read API would introduce library dependency e.g. cid.toV1() will not be a thing.

Although today we still do something like CID.asCID(cid) before anyhow and for other primitives we use PeerId.isPeerId(id) so in practice read API still depends on library. It would change with #3725
💚Type assertions could be a made easy by tagging the first byte of binary representation.
💚 Producer may not even need a library and when it needs one (e.g. because it has to parse user provided string) it will not need to use same library as consumer. Consumer also may not have misplaced expectations on methods available as it would just use static function from library to consume.
💔toString() would no longer be useful.

Side note

It is probably best to not use Uint8Array as is, but instead wrap it into something to avoid collision with arbitrary bytes I think following would provide a reasonable compromise:

interface CID {
  name: "CID"
  bytes: Uint8Array
}

rvagg commented 3 years ago

I've been thinking a lot about this as I use and pass CID around ever since you seeded the idea. The two things I can see being a potential problem are:

The loss of very strong checking of the bytes at each point along a chain of libraries - the fact of having to parse the bytes or string each time it changes hands, while, inefficient but also means we're sure that we have what we think we have at every exit and entry point. Being able to have a soft wrapping around bytes and just wave your hands to say that "yeah, this is probably a CID" but not strictly have to check that it is until you use it, could be a path to more bugs — which is kind of typical when you start to replace type checking for proper testing (runtime and unit testing).
The loss of memoisation. Which is suppose could be dealt with by holding onto a parsed/memoised form internally when you're dealing with it, but there will be a cost. We have a pretty nice _baseCache for toString() operations in multiformats/cid such that repeated toString() operations won't be too expensive. But smaller libraries only passing bytes between them lose all of this at the edges.

rvagg commented 3 years ago

That shouldn't be read as an overall -1 though, I'm actually pretty +1 on experimenting with this @gozala, I'd love to see a more concrete proposal from you for how the mechanics of dealing with CID would be in a world like this.

Gozala commented 3 years ago

1. The loss of very strong checking of the bytes at each point along a chain of libraries - the fact of having to parse the bytes or string each time it changes hands, while, inefficient but also means we're sure that we have what we think we have at every exit and entry point. Being able to have a soft wrapping around bytes and just wave your hands to say that "yeah, this is probably a CID" but not strictly have to check that it is until you use it, could be a path to more bugs — which is kind of typical when you start to replace type checking for proper testing (runtime and unit testing).

I am not sure I fully grasp this argument or specifically how is { type: "CID", bytes: Uint8Array } is more error prone in comparison to CID class instance we have right now ?

Both still require trusting the caller that they've handed you what they've claimed to. And when interfacing with untrusted code it's probably best to do validation either way.

The loss of memoisation. Which is suppose could be dealt with by holding onto a parsed/memoised form internally when you're dealing with it, but there will be a cost. We have a pretty nice _baseCache for toString() operations in multiformats/cid such that repeated toString() operations won't be too expensive. But smaller libraries only passing bytes between them lose all of this at the edges.

I do not believe memorization requires use of custom classes, in fact we could even get better memorization by pushing it further down the stack where actual computation is happening, e.g. here is how CID.prototype.toString could look like:

const toString = (cid, base) =>
   cid.version === 0 ? toStringV0(cid, base) : toStringV1(cid)

const toStringV1 = (cid, base=base32.encoder) => baseEncode(base, cid.bytes)

const baseEncode = (bytes, base) => {
  const cache = baseCache[base.prefix]
  if (cache) {
    const value = cache.get(bytes)
    if (value) {
       return value
    } else {
      const value = base.encode(bytes)
      cache.set(bytes, value)
      return value
    }
  } else {
    const value = base.encode(bytes)
    const cache = new WeakMap()
    cache.set(bytes, value)
    baseCache[base.prefix] = cache
    return value
  }
}
const baseCache = Object.create(null)

P.S: You could generalize this approach to remove most boilerplate, which I did not do for illustration purposes. P.P.S: Weak map lookups WERE slower than regular property lookups, and in cases where perf implication are relevant and restricted access isn't a concern actual properties could be utilized instead e.g. cid['_toString/${base.prefix}'] = base.encode(cid.base)

That shouldn't be read as an overall -1 though, I'm actually pretty +1 on experimenting with this @Gozala, I'd love to see a more concrete proposal from you for how the mechanics of dealing with CID would be in a world like this.

To be clear I'm not entirely convinced this is going to be clear win as opposed to just different tradeoffs. Yet I thought it was a good idea to discuss.

Gozala commented 3 years ago

To elaborate on my different tradeoffs comment. I think there two different approaches to abstractions / modularity:

Interfaces: Encode interaction API. 💚 Enables an implementation agnostic interface with components. 💚 Use case optimized implementations can be made. 💔 Introduces dependency on producer side.
Functions & Data: Represent things as data and encode interaction API as library 💚 Plain Data is great 🎉 Easy to move around, easy to encode / decode 💚 Producer does not need to depend on things ❤️‍🩹 Consumer needs library to work with, but it is free to choose whichever lib / version 💔 Does not generalize as well: Can not just drop in new use case optimized representation, need to upgrade interaction libraries and type representation.

https://github.com/ipfs/js-ipfs/issues/3725 effort moves things in the no 1 direction, while approach described here would push it into no 2 direction. I think no 2 might make more sense for these primitives because

Their representation do not need to change (often)
Generalization is not a concern. One could conceive lazy representations, but I think that would be a bad idea as these types are better verified at input.

P.S: Rust does something really cool here by getting best of both by modeling everything as structs (no 2) and using traits (basically interfaces) for interacting with it.

rvagg commented 3 years ago

The loss of very strong checking of the bytes at each point along a chain of libraries ...

I am not sure I fully grasp this argument or specifically ...

My point is more general that I'm seeing a lot of code across our repos now seemingly relying on type annotations and build:types type scripts to ensure correct arguments rather than the more appropriate actual type checking in JS. This is the worst of both worlds for JavaScript consumers of our libraries. foo(/** @type {boolean} */ bar, /** @type {CID} */ baz) really should also have typeof bar === 'boolean' and a CID.asCID (or similar) for external-facing APIs, but we don't have much of that and I think it's because we're slipping into the trap of being lulled by our type checker, but we actually have very little safety or proper error reporting.

In the case of moving to { name: 'cid', bytes: [..] }, yes we'd be able to nicely assert with TS (direct or annotations), but I forsee more slippage toward "this object I've been passed is supposed to be a CID, so I'll assume it is, after all, TS build would have errored if it may not be". At least now we have the library at hand and CID.asCID() or CID.isCID() are trivial and give us an easy means to error checking.

I don't think I'm offering an argument about the direct utility of what you're suggesting, just that it's pushing us developers more into a false sense of assurance about what type of thing a foo might be, when best practice for JS APIs is to always check and provide useful error messages where the user gives the wrong thing. Not a blocker, just a strong concern about best practices.

achingbrain commented 3 years ago

Using types as a compile-time replacement for duck typing and meaningful error messages is definitely a concern - we have issues opened where people are passing the wrong argument types in, regardless of what the typedefs, the tests or the docs say, usually because we supported that argument type at one point or another. They get obscure error messages with stack traces from deep inside unfamiliar modules and can't figure out why.

As someone who is up to their eyes in a refactor from one CID implementation to another, the thought of having yet another iteration of the representation of a CID does not fill me with joy as integrating these changes is an enormous job. It also burns large amounts of developer goodwill as we continually break the APIs they are depending on.

Gozala commented 3 years ago

On type check not catching errors at runtime.

unit tests aren’t going to catch users passing invalid types at runtime either.
hand written parsers at every layer is just a mess. It would be more productive to wrap things that users interact with to take whatever and either parse input or fail early.
If we really want to address bad input use case, I would much rather invest into a tooling that can generate runtime type guards from static types at compile phase.

On not wanting another big refactor to change representation

I really don’t want one either.
Created this thread because discussions on thread tool place and I found myself making same point as before. So having record of discussion seemed like a good idea.

SgtPooki commented 1 year ago

js-ipfs is being deprecated in favor of Helia. You can https://github.com/ipfs/js-ipfs/issues/4336 and read the migration guide.

Please feel to reopen with any comments by 2023-06-02. We will do a final pass on reopened issues afterward (see https://github.com/ipfs/js-ipfs/issues/4336).

I believe this is going to need a thorough peek from @achingbrain and @BigLep as we decide on enhancements and direction of Helia. Leaving assigned to Alex, but updating js-ipfs deprecation project labels.

achingbrain commented 1 year ago

Helia operates at a block level, modules that implement extra functionality are free to innovate with whatever representation of their arguments they prefer.

Please do publish new modules and experiment!

ipfs / js-ipfs