Closed Gozala closed 1 year ago
I've been thinking a lot about this as I use and pass CID
around ever since you seeded the idea. The two things I can see being a potential problem are:
_baseCache
for toString()
operations in multiformats/cid such that repeated toString()
operations won't be too expensive. But smaller libraries only passing bytes between them lose all of this at the edges.That shouldn't be read as an overall -1 though, I'm actually pretty +1 on experimenting with this @gozala, I'd love to see a more concrete proposal from you for how the mechanics of dealing with CID
would be in a world like this.
1. The loss of very strong checking of the bytes at each point along a chain of libraries - the fact of having to parse the bytes or string each time it changes hands, while, inefficient but also means we're sure that we have what we think we have at every exit and entry point. Being able to have a soft wrapping around bytes and just wave your hands to say that "yeah, this is probably a CID" but not strictly have to check that it is until you use it, could be a path to more bugs β which is kind of typical when you start to replace type checking for proper testing (runtime and unit testing).
I am not sure I fully grasp this argument or specifically how is { type: "CID", bytes: Uint8Array }
is more error prone in comparison to CID class instance we have right now ?
Both still require trusting the caller that they've handed you what they've claimed to. And when interfacing with untrusted code it's probably best to do validation either way.
- The loss of memoisation. Which is suppose could be dealt with by holding onto a parsed/memoised form internally when you're dealing with it, but there will be a cost. We have a pretty nice _baseCache for toString() operations in multiformats/cid such that repeated toString() operations won't be too expensive. But smaller libraries only passing bytes between them lose all of this at the edges.
I do not believe memorization requires use of custom classes, in fact we could even get better memorization by pushing it further down the stack where actual computation is happening, e.g. here is how CID.prototype.toString could look like:
const toString = (cid, base) =>
cid.version === 0 ? toStringV0(cid, base) : toStringV1(cid)
const toStringV1 = (cid, base=base32.encoder) => baseEncode(base, cid.bytes)
const baseEncode = (bytes, base) => {
const cache = baseCache[base.prefix]
if (cache) {
const value = cache.get(bytes)
if (value) {
return value
} else {
const value = base.encode(bytes)
cache.set(bytes, value)
return value
}
} else {
const value = base.encode(bytes)
const cache = new WeakMap()
cache.set(bytes, value)
baseCache[base.prefix] = cache
return value
}
}
const baseCache = Object.create(null)
P.S: You could generalize this approach to remove most boilerplate, which I did not do for illustration purposes.
P.P.S: Weak map lookups WERE slower than regular property lookups, and in cases where perf implication are relevant and restricted access isn't a concern actual properties could be utilized instead e.g. cid['_toString/${base.prefix}'] = base.encode(cid.base)
That shouldn't be read as an overall -1 though, I'm actually pretty +1 on experimenting with this @Gozala, I'd love to see a more concrete proposal from you for how the mechanics of dealing with CID would be in a world like this.
To be clear I'm not entirely convinced this is going to be clear win as opposed to just different tradeoffs. Yet I thought it was a good idea to discuss.
To elaborate on my different tradeoffs comment. I think there two different approaches to abstractions / modularity:
https://github.com/ipfs/js-ipfs/issues/3725 effort moves things in the no 1 direction, while approach described here would push it into no 2 direction. I think no 2 might make more sense for these primitives because
P.S: Rust does something really cool here by getting best of both by modeling everything as structs (no 2) and using traits (basically interfaces) for interacting with it.
The loss of very strong checking of the bytes at each point along a chain of libraries ...
I am not sure I fully grasp this argument or specifically ...
My point is more general that I'm seeing a lot of code across our repos now seemingly relying on type annotations and build:types
type scripts to ensure correct arguments rather than the more appropriate actual type checking in JS. This is the worst of both worlds for JavaScript consumers of our libraries. foo(/** @type {boolean} */ bar, /** @type {CID} */ baz)
really should also have typeof bar === 'boolean'
and a CID.asCID
(or similar) for external-facing APIs, but we don't have much of that and I think it's because we're slipping into the trap of being lulled by our type checker, but we actually have very little safety or proper error reporting.
In the case of moving to { name: 'cid', bytes: [..] }
, yes we'd be able to nicely assert with TS (direct or annotations), but I forsee more slippage toward "this object I've been passed is supposed to be a CID, so I'll assume it is, after all, TS build would have errored if it may not be". At least now we have the library at hand and CID.asCID()
or CID.isCID()
are trivial and give us an easy means to error checking.
I don't think I'm offering an argument about the direct utility of what you're suggesting, just that it's pushing us developers more into a false sense of assurance about what type of thing a foo
might be, when best practice for JS APIs is to always check and provide useful error messages where the user gives the wrong thing. Not a blocker, just a strong concern about best practices.
Using types as a compile-time replacement for duck typing and meaningful error messages is definitely a concern - we have issues opened where people are passing the wrong argument types in, regardless of what the typedefs, the tests or the docs say, usually because we supported that argument type at one point or another. They get obscure error messages with stack traces from deep inside unfamiliar modules and can't figure out why.
As someone who is up to their eyes in a refactor from one CID implementation to another, the thought of having yet another iteration of the representation of a CID does not fill me with joy as integrating these changes is an enormous job. It also burns large amounts of developer goodwill as we continually break the APIs they are depending on.
On type check not catching errors at runtime.
On not wanting another big refactor to change representation
js-ipfs is being deprecated in favor of Helia. You can https://github.com/ipfs/js-ipfs/issues/4336 and read the migration guide.
Please feel to reopen with any comments by 2023-06-02. We will do a final pass on reopened issues afterward (see https://github.com/ipfs/js-ipfs/issues/4336).
I believe this is going to need a thorough peek from @achingbrain and @BigLep as we decide on enhancements and direction of Helia. Leaving assigned to Alex, but updating js-ipfs deprecation project labels.
Helia operates at a block level, modules that implement extra functionality are free to innovate with whatever representation of their arguments they prefer.
Please do publish new modules and experiment!
We've discussed this idea on various ephemeral channels before, but as @vasco-santos pointed out it would be a lot more useful to have this discussion in the issue.
Problem statement
We have bunch of user space primitives like
CID
,Multiaddr
,PeerId
and they all come with predicates likeisCID
and are basically wrappedUint8Array
with bunch of methods for common tasks.This approach has following properties
Alternatives
An alternative approach could instead embrace binary representation of those primitives and provide an APIs to work with via corresponding libraries multiformats, peer-id, etc...
It is true that this shifts dependency requirement from a producer to a consumer and I have no data to suggest that it would be a net win. But here some hypothesis:
cid.toV1()
will not be a thing.toString()
would no longer be useful.Side note
It is probably best to not use Uint8Array as is, but instead wrap it into something to avoid collision with arbitrary bytes I think following would provide a reasonable compromise: