ipfs / js-ipfs

IPFS implementation in JavaScript
https://js.ipfs.tech
Other
7.44k stars 1.25k forks source link

Discuss: Embrace binary representation of ipfs/libp2p/ipld primitive like CID, Multiaddr, PeerId #3730

Closed Gozala closed 1 year ago

Gozala commented 3 years ago

We've discussed this idea on various ephemeral channels before, but as @vasco-santos pointed out it would be a lot more useful to have this discussion in the issue.

Problem statement

We have bunch of user space primitives like CID, Multiaddr, PeerId and they all come with predicates like isCID and are basically wrapped Uint8Array with bunch of methods for common tasks.

This approach has following properties

Alternatives

An alternative approach could instead embrace binary representation of those primitives and provide an APIs to work with via corresponding libraries multiformats, peer-id, etc...

It is true that this shifts dependency requirement from a producer to a consumer and I have no data to suggest that it would be a net win. But here some hypothesis:

Side note

It is probably best to not use Uint8Array as is, but instead wrap it into something to avoid collision with arbitrary bytes I think following would provide a reasonable compromise:

interface CID {
  name: "CID"
  bytes: Uint8Array
}
rvagg commented 3 years ago

I've been thinking a lot about this as I use and pass CID around ever since you seeded the idea. The two things I can see being a potential problem are:

  1. The loss of very strong checking of the bytes at each point along a chain of libraries - the fact of having to parse the bytes or string each time it changes hands, while, inefficient but also means we're sure that we have what we think we have at every exit and entry point. Being able to have a soft wrapping around bytes and just wave your hands to say that "yeah, this is probably a CID" but not strictly have to check that it is until you use it, could be a path to more bugs β€” which is kind of typical when you start to replace type checking for proper testing (runtime and unit testing).
  2. The loss of memoisation. Which is suppose could be dealt with by holding onto a parsed/memoised form internally when you're dealing with it, but there will be a cost. We have a pretty nice _baseCache for toString() operations in multiformats/cid such that repeated toString() operations won't be too expensive. But smaller libraries only passing bytes between them lose all of this at the edges.
rvagg commented 3 years ago

That shouldn't be read as an overall -1 though, I'm actually pretty +1 on experimenting with this @gozala, I'd love to see a more concrete proposal from you for how the mechanics of dealing with CID would be in a world like this.

Gozala commented 3 years ago
1. The loss of very strong checking of the bytes at each point along a chain of libraries - the fact of having to parse the bytes or string each time it changes hands, while, inefficient but also means we're sure that we have what we think we have at every exit and entry point. Being able to have a soft wrapping around bytes and just wave your hands to say that "yeah, this is probably a CID" but not strictly have to check that it is until you use it, could be a path to more bugs β€” which is kind of typical when you start to replace type checking for proper testing (runtime and unit testing).

I am not sure I fully grasp this argument or specifically how is { type: "CID", bytes: Uint8Array } is more error prone in comparison to CID class instance we have right now ?

Both still require trusting the caller that they've handed you what they've claimed to. And when interfacing with untrusted code it's probably best to do validation either way.

  1. The loss of memoisation. Which is suppose could be dealt with by holding onto a parsed/memoised form internally when you're dealing with it, but there will be a cost. We have a pretty nice _baseCache for toString() operations in multiformats/cid such that repeated toString() operations won't be too expensive. But smaller libraries only passing bytes between them lose all of this at the edges.

I do not believe memorization requires use of custom classes, in fact we could even get better memorization by pushing it further down the stack where actual computation is happening, e.g. here is how CID.prototype.toString could look like:

const toString = (cid, base) =>
   cid.version === 0 ? toStringV0(cid, base) : toStringV1(cid)

const toStringV1 = (cid, base=base32.encoder) => baseEncode(base, cid.bytes)

const baseEncode = (bytes, base) => {
  const cache = baseCache[base.prefix]
  if (cache) {
    const value = cache.get(bytes)
    if (value) {
       return value
    } else {
      const value = base.encode(bytes)
      cache.set(bytes, value)
      return value
    }
  } else {
    const value = base.encode(bytes)
    const cache = new WeakMap()
    cache.set(bytes, value)
    baseCache[base.prefix] = cache
    return value
  }
}
const baseCache = Object.create(null)

P.S: You could generalize this approach to remove most boilerplate, which I did not do for illustration purposes. P.P.S: Weak map lookups WERE slower than regular property lookups, and in cases where perf implication are relevant and restricted access isn't a concern actual properties could be utilized instead e.g. cid['_toString/${base.prefix}'] = base.encode(cid.base)

That shouldn't be read as an overall -1 though, I'm actually pretty +1 on experimenting with this @Gozala, I'd love to see a more concrete proposal from you for how the mechanics of dealing with CID would be in a world like this.

To be clear I'm not entirely convinced this is going to be clear win as opposed to just different tradeoffs. Yet I thought it was a good idea to discuss.

Gozala commented 3 years ago

To elaborate on my different tradeoffs comment. I think there two different approaches to abstractions / modularity:

  1. Interfaces: Encode interaction API. πŸ’š Enables an implementation agnostic interface with components. πŸ’š Use case optimized implementations can be made. πŸ’” Introduces dependency on producer side.
  2. Functions & Data: Represent things as data and encode interaction API as library πŸ’š Plain Data is great πŸŽ‰ Easy to move around, easy to encode / decode πŸ’š Producer does not need to depend on things β€οΈβ€πŸ©Ή Consumer needs library to work with, but it is free to choose whichever lib / version πŸ’” Does not generalize as well: Can not just drop in new use case optimized representation, need to upgrade interaction libraries and type representation.

https://github.com/ipfs/js-ipfs/issues/3725 effort moves things in the no 1 direction, while approach described here would push it into no 2 direction. I think no 2 might make more sense for these primitives because

  1. Their representation do not need to change (often)
  2. Generalization is not a concern. One could conceive lazy representations, but I think that would be a bad idea as these types are better verified at input.

P.S: Rust does something really cool here by getting best of both by modeling everything as structs (no 2) and using traits (basically interfaces) for interacting with it.

rvagg commented 3 years ago

The loss of very strong checking of the bytes at each point along a chain of libraries ...

I am not sure I fully grasp this argument or specifically ...

My point is more general that I'm seeing a lot of code across our repos now seemingly relying on type annotations and build:types type scripts to ensure correct arguments rather than the more appropriate actual type checking in JS. This is the worst of both worlds for JavaScript consumers of our libraries. foo(/** @type {boolean} */ bar, /** @type {CID} */ baz) really should also have typeof bar === 'boolean' and a CID.asCID (or similar) for external-facing APIs, but we don't have much of that and I think it's because we're slipping into the trap of being lulled by our type checker, but we actually have very little safety or proper error reporting.

In the case of moving to { name: 'cid', bytes: [..] }, yes we'd be able to nicely assert with TS (direct or annotations), but I forsee more slippage toward "this object I've been passed is supposed to be a CID, so I'll assume it is, after all, TS build would have errored if it may not be". At least now we have the library at hand and CID.asCID() or CID.isCID() are trivial and give us an easy means to error checking.

I don't think I'm offering an argument about the direct utility of what you're suggesting, just that it's pushing us developers more into a false sense of assurance about what type of thing a foo might be, when best practice for JS APIs is to always check and provide useful error messages where the user gives the wrong thing. Not a blocker, just a strong concern about best practices.

achingbrain commented 3 years ago

Using types as a compile-time replacement for duck typing and meaningful error messages is definitely a concern - we have issues opened where people are passing the wrong argument types in, regardless of what the typedefs, the tests or the docs say, usually because we supported that argument type at one point or another. They get obscure error messages with stack traces from deep inside unfamiliar modules and can't figure out why.

As someone who is up to their eyes in a refactor from one CID implementation to another, the thought of having yet another iteration of the representation of a CID does not fill me with joy as integrating these changes is an enormous job. It also burns large amounts of developer goodwill as we continually break the APIs they are depending on.

Gozala commented 3 years ago

On type check not catching errors at runtime.

On not wanting another big refactor to change representation

SgtPooki commented 1 year ago

js-ipfs is being deprecated in favor of Helia. You can https://github.com/ipfs/js-ipfs/issues/4336 and read the migration guide.

Please feel to reopen with any comments by 2023-06-02. We will do a final pass on reopened issues afterward (see https://github.com/ipfs/js-ipfs/issues/4336).

I believe this is going to need a thorough peek from @achingbrain and @BigLep as we decide on enhancements and direction of Helia. Leaving assigned to Alex, but updating js-ipfs deprecation project labels.

achingbrain commented 1 year ago

Helia operates at a block level, modules that implement extra functionality are free to innovate with whatever representation of their arguments they prefer.

Please do publish new modules and experiment!