Can Block implement Reader interface instead of creating an instance ?

Gozala commented 5 years ago

I find it awkward having to have a .reader() method that return Reader instance because use case I have in mind would pretty much always require me to get a reader. Why not just incorporate get, links, tree into Block ?

mikeal commented 5 years ago

hrm.... the only problem I see is that get, tree and links are all synchronous but reader creation is async. if we were to lazily create the reader instance and cache it on the instance all those methods would have to be async. this is minimal overhead for get but, right now, async generators are a bit slower than normal generators and would add a fair amount of event system overhead on what is effectively a set of sync operations.

mikeal commented 5 years ago

To be clear, I’m in agreement with you that moving these methods to the top level would be a nicer API and developer experience, I just want to figure out a good solution to this async vs sync issue.

Gozala commented 5 years ago

hrm.... the only problem I see is that get, tree and links are all synchronous but reader creation is async. if we were to lazily create the reader instance and cache it on the instance all those methods would have to be async. this is minimal overhead for get but, right now, async generators are a bit slower than normal generators and would add a fair amount of event system overhead on what is effectively a set of sync operations.

I did not realized they were sync, but that is actually a a problem IMO. What if decode / encode needs to be async which would be a case whenever cryptography is involved.

Gozala commented 5 years ago

It's not a problem in existing examples because whole block is decoded in one step (which is presumably why it getting reader is async & rest can be sync) but then if we consider #3 where you don't want to decode the whole thing it seems this would not really work.

Gozala commented 5 years ago

To be clear, I’m in agreement with you that moving these methods to the top level would be a nicer API and developer experience, I just want to figure out a good solution to this async vs sync issue.

In the other threads and in the submitted type defs I proposed to make all ops return Eventual<a> = a | Promise<a> which would allow implementations to return in the same tick when possible or a promise otherwise. I would suggest that might be reasonable way to deal with async / sync dilemma as things like cbor could just implement sync get / list / tree and implementations that can't will use async APIs instead.

mikeal commented 5 years ago

I did not realized they were sync, but that is actually a a problem IMO. What if decode / encode needs to be async which would be a case whenever cryptography is involved.

This is a little complicated and not well documented, so let me walk through the full interface and you can tell me if this works for your use case or not.

Keep in mind that inline blocks are effectively new block instances.
Creating the reader is async, so if there is async work necessary to do reading it can happen there, it doesn’t need to happen in the Reader() instance itself.

This means that you can still do async work for partial decodings and decryption assuming that you only need a single endpoint per block to do so. If there is a case where you actually need to do async work in partial decoding that must happen on a per property basis then this won’t work and we’ll need to re-think it a bit.

Gozala commented 5 years ago

If there is a case where you actually need to do async work in partial decoding that must happen on a per property basis then this won’t work and we’ll need to re-think it a bit.

That is what I meant.

Gozala commented 5 years ago

For example Flatbuffers use index to know offsets of all the encoded data and there for you can read deeply nested field value without decoding anything but the bytes for that value. Now let say actual decoding is async operation, that would imply that each property decode is async operation as well & then you have a problem.

Now flatbuffer API (if I recall correctly) isn't actually async so it should not necessarily be a problem, however if you consider case where properties might be encrypted decode would need to decrypt and become async and current API design would be incompatible with such use case. Which is why I'm suggesting to not bake that assumption into API or such use cases will become problematic.

Gozala commented 5 years ago

I think https://github.com/ipld/js-ipld-stack/issues/10#issuecomment-480086662 provides a nice solution, as you can obtain BlockView synchronously (which is similar to Reader) but do actual decoding asynchronously when .decode() is called. Essentially await step later.

mikeal commented 5 years ago

Pulling in @rvagg and @mcollina for a question: “At what scale do we need to worry about the performance impact of async generators vs regular generators?”

I’m tempted to just make all the interfaces async in order to open up more options in the future but I’m a little worried about the perf hit.

mcollina commented 5 years ago

It’s significant in hot paths. It’s a promise per chunk and at least a microtick. Moreover if they are not implemented natively (old browsers) it’s 100+ lines of code each.

I would not make all functions async for similar reasons.

rvagg commented 5 years ago

The Eventual<a> = a | Promise<a> is nice, but unfortunately afaik current JS engines will still wrap a return value in a Promise if you use it as if there is one. I think this goes equally for await and for await too sadly. The semantics require crossing the next-tick boundary if you use await, just like whenever you touch a Promise I believe.

For performance-critical cases I bet it wouldn't be hard to come up with an iteration interface that is conditional on whether an API returns a Promise or not. Eschew the much nicer for await () for a clunkier API: iterate(thing, onElement) where iterate() pulls element = thing() and if it's synchronous then onElement(element) is called synchronously, if it's a Promise then it traverses the microtick boundary (at least) but that's hopefully indicating that there are inherent costs, like decryption, that need to be deferred.

But that's assuming that this even becomes a bottleneck! I bet by the time the other bottlenecks are taken care of (CBOR decoding, block decryption, and of course I/O costs of fetching blocks in the first place!) and this becomes the next most important bottleneck that JS engines make this problem go away (in a relative sense). Things like this which was just merged into ECMA262 certainly help and V8 has an implementation behind a flag waiting (maybe it's now unflagged?).

rvagg commented 5 years ago

Also, in the inline-block case, couldn't a Reader encounter an inline block that needs to go through a decode() cycle, so get() (and friends) would need to be async anyway? Or is the assumption that inline blocks would be decoded in the initial decode() pass? Maybe that's too much cost, because you want to get() something that's not in an inline block so you could avoid the secondary decode cost all together, which would be nice if that's costly (e.g. decryption).

{
  "cheap": "thing",
  "secret-box": encrypted-inline-block
}

Where you just want to get('/cheap'). But being able to defer the costs of dealing with an inline block assumes pushing async all the way up to get() et. al. doesn't it?

Gozala commented 5 years ago

You can avoid that:

const unbox = v => v instanceof Promise ? null : [v]

const main = async (block) => {
   const node = block.decode()
   const maybeData = unbox(node)
   const data = maybeData ? maybeData[0] : await node
   // ....
}

Not pretty but will do the trick

mcollina commented 5 years ago

@rvagg https://github.com/tc39/ecma262/pull/1250 is as much as we’ll get from the soec. A fact not very well know is that I did all the Node-backing analysis for that :/. Even with that change, await uses two microticks.

The other issue is that using await prevents V8 inlining and object escape analysis. It will always have a cost - I agree it will go down, but it will never disappear. When I did my perf analysis of ipfs, I found a “death by thousand cuts” situation, mostly due by the composability over performance principle. One of the most used patterns was using setImmediate to fake asynchronicity :/.

mikeal commented 5 years ago

Given that the reader method is async, we’re already giving an entry point for async work. I think it’s reasonable to say “you can do partial reads, but you have to do any async work up-front so that the individual reads are sync” if that’s the performance profile we actually need from people that are implementing that interface. This is all at the block level, so requiring sync reads from an atomic entity like a block seems reasonable given the constraints we’re under.

Gozala commented 5 years ago

This is not a comment in favor of async vs sync API - I am getting a sense that encrypted blocks are treated somewhat as a secondary use case. I would like to invite everyone in the community to flip that around and think of non-encrypted blocks as secondary use case, otherwise I fear we'll end up building an ecosystem where security & privacy are afterthoughts.

Gozala commented 5 years ago

Now back to sync / async arguments. I think choosing one would be a mistake as it automatically makes the other choice a secondary use case. If you choose always async you end up paying in microtasks if you do choose sync then you end making lazy decoding a second class and probably paying both in memory & CPU.

This is why I am advocating to to make all methods return Eventual<a> as in a | Promise<a> that allows producer (codec) to optimize what matters on it's end and allows consumer to optimize what matters on that end. As https://github.com/ipld/js-ipld-stack/issues/7#issuecomment-481107429 it would allow consumer avoid await-ing whenever producer is able to fulfill request immediately & will await when it isn't. In non performance critical code (e.g. demo, learning material) that optimization could be ignored and plain await could be used everywhere. Better yet both consumer and producer can start with naive (async methods) implementation and optimize it later on by avoiding promises when results can be returned in the same tick.

mikeal commented 5 years ago

Does this penalty apply when I do let x = await syncMethod()?

mikeal commented 5 years ago

using setImmediate to fake asynchronicity

These probably came from the waterfall code in all the interfaces. My concern is that this may still be a problem after the move to async/await but will be a little harder to observe as such.

mikeal commented 5 years ago

I would like to invite everyone in the community to flip that around and think of non-encrypted blocks as secondary use case, otherwise I fear we'll end up building an ecosystem where security & privacy are afterthoughts.

The question isn’t about what use cases are secondary vs primary, it’s about what layer of the stack those use cases belong in and if those interfaces can be synchronous. Even if the codec format layer goes sync (there’s a separate thread about that and I’m not sure it will) the Block encode/decode methods will still need to be async because get-format MUST be async because in the browser it needs to dynamically load the codec. Whatever the perf penalty an async wrap imposes on this it’s nothing compared to the eventual default bundle size we’ll have if we have to bundle every codec you ever might need ;) Note: we could improve the performance of this in the future by caching the module return and using the raw promises interface, making the encode/decode interfaces lose their async wrap when the encode/decode and the codec lookup happen synchronously.

So, this block interface will still have async methods for encode/decode/reader, the remaining question is whether we can keep the individual read methods in the Reader interface sync, and I think we can. At the Block layer interface it’s reasonable to expect that any necessary decryption for partial reads would be done up-front. The encryption use cases I’ve seen so far using multiple blocks (with inline blocks) which means that reading into an encrypted property is a multiblock operation and the block will need to be decrypted atomically which means you can do the individual reads on the block synchronously.

Gozala commented 5 years ago

Does this penalty apply when I do let x = await syncMethod()?

That async function will be suspended on await regardless of what syncMethod returns. I know there had being some discussions on spider-monkey team to do a clever analyses to not suspend when it would not be perceivable, but don't know exactly they landed on that.

Gozala commented 5 years ago

So the way I see it (please correct me if I'm mistaken) only thing that reader() gives you is an ability to do one async call to reader() be then be able to do .get(), .tree(), .links() without further awaiting. Implying that there is an expectation that one would need to do multiple of those and that why it matters.

However I'm really not seeing how I could accomplish providing sync get(), tree(), links() without actually going and decoding the whole thing. Or in cases where I can do that, I don't expect I'd need to do async operation to create a reader.

It very much could be that use cases I have in mind is "multiblock" (I don't actually know what that is), but from where I stand reader() is forces me to use more complicated interface without allowing me to optimize anything.

Gozala commented 5 years ago

I am also doing some experimentation with this API & forked version of it as I think it would help me better understand my own and assumed constraints. Hopefully I'll have some more evidence based feedback later this week.

Gozala commented 5 years ago

I think it might be best to try and implement Flatbuffer IPLD codec as that would provide a specific example of encoding where blocks could be traversed without decoding & each field could be decoded without having to decode anything else.

mikeal commented 5 years ago

A quick look at the JS implementation of Flabuffers https://github.com/google/flatbuffers/blob/master/js/flatbuffers.js and it looks like everything is synchronous.

Gozala commented 5 years ago

At the Block layer interface it’s reasonable to expect that any necessary decryption for partial reads would be done up-front. The encryption use cases I’ve seen so far using multiple blocks (with inline blocks) which means that reading into an encrypted property is a multiblock operation and the block will need to be decrypted atomically which means you can do the individual reads on the block synchronously.

Ok so let's say I want to express SSB with IPLD, here is specific message format used there:

If I were to map it to IPLD node this is more or less what it would look like:

type PrivateMessage<message> = {
  nonce:string,
  key:BodyPublicKey
  recepients: Concat<Encypted<Concat<[RecepientCount, BodyPrivateKey]>, RecepientPublicKey>[]>
  body:Encrypted<message, BodyPrivateKey>
}

Now it is true that it would be reasonably to redesign format such body will be own inline block and recepients as well. But the problem is that would require convincing SSB community to switch to new format & even then there will be no way to handle messages encoded in prior format.

This is just an one example that proposed reader interface can not handle well - that is I can't just do .get('key') without having to go and decrypt everything. I do however thing that this is not the only case many encrypted protocols do use similar message formats because they did not had multihash, CID, multicodec as design constraint. I really want to be able to support arbitrary formats like these with IPLD instead of rewriting all the existing software to match those constraints.

Gozala commented 5 years ago

A quick look at the JS implementation of Flabuffers https://github.com/google/flatbuffers/blob/master/js/flatbuffers.js and it looks like everything is synchronous.

I'm aware of that & I think I've mentioned that in other threads that it is combination of two constraints that make it problematic:

Format allows traversing data without decoding
Data fields are encrypted

Because than you can traverse provide (links, tree) and access individual fields (get) without having to do both.

Gozala commented 5 years ago

@mikeal does https://github.com/ipld/js-ipld-stack/issues/7#issuecomment-481395037 clarify my argument at all ? I think it's better & simpler example to focus on.

mikeal commented 5 years ago

@Gozala sort of, but I already had this use case in mind. The thing is, the Data Model doesn’t know what an “encrypted field” is. The way encryption will have to be layered in here is through schemas, somehow, and that’s a layer above this Block interface. The schema is what will tell a traverser that “properties are interpreted differently in this data structure” and the schema layer will have to do most things async because many schemas are built for multi-block data structures.

For instance, we’ll have a schema for a hamt, which is a multi-block Map data structure, and it will tell the traverser “This is a thing called hamt-v1, use an implementation of that data structure to lookup keys in it.” The Reader interface is intended for use by the implementor of that data structure, the traverser will essentially hand it these block instances with this reader interface and will get back values looked up through multiple layers of links in the graph.

I expect encryption to work similarly, Reader is a block level tool for the decryption program to use to get at data in the block but it will be returning values through a different async interface, not through the Reader interface itself. In other words, an encrypted data structure implementor is a consumer of the Reader interface, not an implementor, and they’ll be implementing a different async interface we haven’t defined yet because we’re still working on finalizing the Block level interface 😊

As far as the IPLD Data Model (layer graphic, spec) is concerned, an encrypted field is either:

Binary
Link to another block (probably with a schema attached that tells you how to decrypt it, but the Data Model isn’t concerned with that).
An inline block (CID w/ identity multihash) doing the same thing as 2.

Anything else is going to have to happen a layer up, which right now we think will be using IPLD Schemas (we’re still working on specifying this, the first test will probably be hamt).

Gozala commented 5 years ago

I think I found to distill this into a simpler example. Imagine message format that contains three fields a, b, c and is encoded as <a><b><c>. It is possible to decode a without any asynchronicity. However to read b you need to read a (which is a secretKey) that you need to decode contents of b there for decoding b is async OP. As of c you need another key from b to decrypt that as well - so it's also async & it's encrypted with diff key - meaning that sometimes b is enough and decoding c would be a waste. At other times you may don't even care about b or c because a has metadata telling you it's not for you.

Gozala commented 5 years ago

@mikeal thanks for explaining (I posted above comment before yours got through). I have to admit I do not fully understand how two layers would fit together, but I'm getting a sense that what you're telling me is that Block for that example would have to be represented like:

type PrivateMessage<a> = { nonce:string, key:BodyPublicKey, secret:SecretBox<a> }
type SecretBox<a> = { recepients:Iterator<Buffer>, body(BodyPrivateKey):a }

So that get('secret') would give me some bytes and associated schema somehow will know to decode it as SecretBox<a> and yet another schema will know that if I was able to decrypt recipient (with my private key) than I can use that info to find body offset and decrypt it with that key.

This makes me wonder if decode / encode or any of the reader methods even need to be async. I am starting to think they don't in fact all the existing IPLD resolvers probably be implemented as such.

I am also somewhat puzzled about schema / codec separation mostly because I imagine them interconnected. I mean I expect schema will guide traversal on how to interpret the block, but then sometimes (like in discussed example) information to guide this is aggregated from data extracted from other fields. It almost feels to me that described use case doesn't even need codex and it only needs a schema.

Gozala commented 5 years ago

@mikeal where can I find schema for hamt or read up on multi-block Map data structures ?

mikeal commented 5 years ago

@Gozala today is a good day to ask that question :) Rod just wrote up a good proposal for how we talk about collections (multi-block data structures) https://github.com/ipld/specs/pull/110/files#diff-40c75a0f92193756e5dfa92bdeaf2165R103

There’s also a hamt spec but it’s very specific to cbor right now because we haven’t yet landed the schema spec which is what provides a lot of the language to talk about this in a way that isn’t codec specific https://github.com/ipld/specs/pull/109

Also, @warpfork has been working on schemas, some examples are here https://github.com/ipld/go-ipld-prime/tree/master/typed/declaration and he gave a talk on it yesterday that we recorded (I invited you but I don’t think you were able to make it) https://protocol.zoom.us/recording/share/mq_shATA6zhzI-XYlSsUMTveKnMCgHlDuTiqL8xVwr2wIumekTziMw

Gozala commented 5 years ago

Watched presentation from @warpfork (which made me wondering if he's aware of https://typedefs.com/about/) which is really interesting, but I still don't understand how one would go about extracting bits from one field of the struct to use those for decoding another bit. I am suspecting that briefly mentioned Advanced Data Model might be a way to address this, but still unclear how one would do it. I'll follow other pointers to see if I can get a better idea, but if someones is able to enlighten me with specific answers I'd really appreciate it.

rvagg commented 5 years ago

The sync-reader and async decoder model still seems to work if we're encoding encrypted components within CIDs. So a synchronous reader.get() that encounters a CID just returns { value, remaining } as it does now, you take value, which is a CID, and it gets passed through the standard asynchronous fetch-by-CID-and-decode process where you can handle decryption asynchronously if needed and you then get back a synchronous reader on top of a decrypted block.

So maybe the focus to support seamless (or near seamless) encrypted blocks needs to be on the fetch-by-CID process.

let block = await dag.get(encrypted) // where `encrypted` is a CID with inline encrypted content

That get() needs the smarts to understand that this is not an identifier of remote content but content itself. Do we need to create a new abstraction around the fetch process to handle this, that IPFS can plug in to but could also be backed by any other kind of store?

Gozala commented 5 years ago

I burned through the bandwidth I had available for this, so I'd like to summarize where I stand with this as I don't know if I'll have a chance to contribute to the new API in near future.

It seems that my use cases seem to be falling into area that Advanced Data Model / Schemas are supposed to resolve. I looked at the provided links & video, but I'm still uncertain how one would go about it. I think it would be useful to design this API in composition with the layer above so it's more clear what it should & should not solve.
I am still not convinced about async Block API with sync Reader API. I have an impression that constraints codec implementer and in some cases may lead to suboptimal (code or perf) results. To be I don't think constraining API is necessarily bad, it's just I don't understand value proposition / motivation here. Mostly discussions were around not wasting ticks but than again that could be addressed differently e.g. https://github.com/ipld/js-ipld-stack/issues/7#issuecomment-481107429

Hope this feedback is helpful

rvagg commented 5 years ago

"Advanced Data Model / Schemas" could indeed be helpful for non-trivial decode operations, I've been working on how this might operate and potentially be somewhat transparent to the user and might post some initial thoughts into ipld/specs. I'm mainly focused on block-spanning collections but advanced within-block operations that require additional logic is a good use-case to keep in mind up front.

Regarding the async thing--the trade-off for performance is in the very awkward Promise-checking code required for fast-path. I'm coming to the conclusion that it's worth it to make sure we get encryption as a first-class citizen in IPLD. If it's a bolt-on operation rather than something that can be done transparently then it's always going to be awkward, like adding the s to http. It's just going to mean that we take care when building on top of optionally-async APIs in performance-important paths.

Gozala commented 5 years ago

It doesn’t need to be awkward it can be prettier:

Task.spawn(function* () {
   const block = Block.decoder(bytes, "dag-cbor")
   // can resume right away if yielded value isn't a promise or await and resume then
   const cid = yield block.cid()
   // get can return right away if field can be accessed sync or cached otherwise will await.
   const field = block.get('secret')
   // Can also do equivalent of `Promise.all` / `Promise.race` without awaiting
   const [links, paths] = yield Task.all([block.links(), block.tree()]) 
})

Implementation of spawn can decide when to resume generator (wait if it’s promise or resume right away if not)

ipld / js-ipld-stack

Can Block implement Reader interface instead of creating an instance ? #7