Open Gozala opened 5 years ago
hrm.... the only problem I see is that get
, tree
and links
are all synchronous but reader creation is async. if we were to lazily create the reader instance and cache it on the instance all those methods would have to be async. this is minimal overhead for get
but, right now, async generators are a bit slower than normal generators and would add a fair amount of event system overhead on what is effectively a set of sync operations.
To be clear, I’m in agreement with you that moving these methods to the top level would be a nicer API and developer experience, I just want to figure out a good solution to this async vs sync issue.
hrm.... the only problem I see is that
get
,tree
andlinks
are all synchronous but reader creation is async. if we were to lazily create the reader instance and cache it on the instance all those methods would have to be async. this is minimal overhead forget
but, right now, async generators are a bit slower than normal generators and would add a fair amount of event system overhead on what is effectively a set of sync operations.
I did not realized they were sync, but that is actually a a problem IMO. What if decode
/ encode
needs to be async which would be a case whenever cryptography is involved.
It's not a problem in existing examples because whole block is decoded in one step (which is presumably why it getting reader is async & rest can be sync) but then if we consider #3 where you don't want to decode the whole thing it seems this would not really work.
To be clear, I’m in agreement with you that moving these methods to the top level would be a nicer API and developer experience, I just want to figure out a good solution to this async vs sync issue.
In the other threads and in the submitted type defs I proposed to make all ops return Eventual<a> = a | Promise<a>
which would allow implementations to return in the same tick when possible or a promise otherwise. I would suggest that might be reasonable way to deal with async / sync dilemma as things like cbor could just implement sync get / list / tree and implementations that can't will use async APIs instead.
I did not realized they were sync, but that is actually a a problem IMO. What if decode / encode needs to be async which would be a case whenever cryptography is involved.
This is a little complicated and not well documented, so let me walk through the full interface and you can tell me if this works for your use case or not.
reader
is async, so if there is async work necessary to do reading it can happen there, it doesn’t need to happen in the Reader() instance itself.This means that you can still do async work for partial decodings and decryption assuming that you only need a single endpoint per block to do so. If there is a case where you actually need to do async work in partial decoding that must happen on a per property basis then this won’t work and we’ll need to re-think it a bit.
If there is a case where you actually need to do async work in partial decoding that must happen on a per property basis then this won’t work and we’ll need to re-think it a bit.
That is what I meant.
For example Flatbuffers use index to know offsets of all the encoded data and there for you can read deeply nested field value without decoding anything but the bytes for that value. Now let say actual decoding is async operation, that would imply that each property decode is async operation as well & then you have a problem.
Now flatbuffer API (if I recall correctly) isn't actually async so it should not necessarily be a problem, however if you consider case where properties might be encrypted decode would need to decrypt and become async and current API design would be incompatible with such use case. Which is why I'm suggesting to not bake that assumption into API or such use cases will become problematic.
I think https://github.com/ipld/js-ipld-stack/issues/10#issuecomment-480086662 provides a nice solution, as you can obtain BlockView
synchronously (which is similar to Reader
) but do actual decoding asynchronously when .decode()
is called. Essentially await
step later.
Pulling in @rvagg and @mcollina for a question: “At what scale do we need to worry about the performance impact of async generators vs regular generators?”
I’m tempted to just make all the interfaces async in order to open up more options in the future but I’m a little worried about the perf hit.
It’s significant in hot paths. It’s a promise per chunk and at least a microtick. Moreover if they are not implemented natively (old browsers) it’s 100+ lines of code each.
I would not make all functions async for similar reasons.
The Eventual<a> = a | Promise<a>
is nice, but unfortunately afaik current JS engines will still wrap a return value in a Promise if you use it as if there is one. I think this goes equally for await
and for await
too sadly. The semantics require crossing the next-tick boundary if you use await
, just like whenever you touch a Promise
I believe.
For performance-critical cases I bet it wouldn't be hard to come up with an iteration interface that is conditional on whether an API returns a Promise or not. Eschew the much nicer for await ()
for a clunkier API: iterate(thing, onElement)
where iterate()
pulls element = thing()
and if it's synchronous then onElement(element)
is called synchronously, if it's a Promise
then it traverses the microtick boundary (at least) but that's hopefully indicating that there are inherent costs, like decryption, that need to be deferred.
But that's assuming that this even becomes a bottleneck! I bet by the time the other bottlenecks are taken care of (CBOR decoding, block decryption, and of course I/O costs of fetching blocks in the first place!) and this becomes the next most important bottleneck that JS engines make this problem go away (in a relative sense). Things like this which was just merged into ECMA262 certainly help and V8 has an implementation behind a flag waiting (maybe it's now unflagged?).
Also, in the inline-block case, couldn't a Reader
encounter an inline block that needs to go through a decode()
cycle, so get()
(and friends) would need to be async anyway? Or is the assumption that inline blocks would be decoded in the initial decode()
pass? Maybe that's too much cost, because you want to get()
something that's not in an inline block so you could avoid the secondary decode cost all together, which would be nice if that's costly (e.g. decryption).
{
"cheap": "thing",
"secret-box": encrypted-inline-block
}
Where you just want to get('/cheap')
. But being able to defer the costs of dealing with an inline block assumes pushing async all the way up to get()
et. al. doesn't it?
You can avoid that:
const unbox = v => v instanceof Promise ? null : [v]
const main = async (block) => {
const node = block.decode()
const maybeData = unbox(node)
const data = maybeData ? maybeData[0] : await node
// ....
}
Not pretty but will do the trick
@rvagg https://github.com/tc39/ecma262/pull/1250 is as much as we’ll get from the soec. A fact not very well know is that I did all the Node-backing analysis for that :/. Even with that change, await uses two microticks.
The other issue is that using await prevents V8 inlining and object escape analysis. It will always have a cost - I agree it will go down, but it will never disappear. When I did my perf analysis of ipfs, I found a “death by thousand cuts” situation, mostly due by the composability over performance principle. One of the most used patterns was using setImmediate to fake asynchronicity :/.
Given that the reader
method is async, we’re already giving an entry point for async work. I think it’s reasonable to say “you can do partial reads, but you have to do any async work up-front so that the individual reads are sync” if that’s the performance profile we actually need from people that are implementing that interface. This is all at the block level, so requiring sync reads from an atomic entity like a block seems reasonable given the constraints we’re under.
This is not a comment in favor of async vs sync API - I am getting a sense that encrypted blocks are treated somewhat as a secondary use case. I would like to invite everyone in the community to flip that around and think of non-encrypted blocks as secondary use case, otherwise I fear we'll end up building an ecosystem where security & privacy are afterthoughts.
Now back to sync / async arguments. I think choosing one would be a mistake as it automatically makes the other choice a secondary use case. If you choose always async you end up paying in microtasks if you do choose sync then you end making lazy decoding a second class and probably paying both in memory & CPU.
This is why I am advocating to to make all methods return Eventual<a>
as in a | Promise<a>
that allows producer (codec) to optimize what matters on it's end and allows consumer to optimize what matters on that end. As https://github.com/ipld/js-ipld-stack/issues/7#issuecomment-481107429 it would allow consumer avoid await
-ing whenever producer is able to fulfill request immediately & will await
when it isn't. In non performance critical code (e.g. demo, learning material) that optimization could be ignored and plain await
could be used everywhere. Better yet both consumer and producer can start with naive (async
methods) implementation and optimize it later on by avoiding promises when results can be returned in the same tick.
Does this penalty apply when I do let x = await syncMethod()
?
using setImmediate to fake asynchronicity
These probably came from the waterfall code in all the interfaces. My concern is that this may still be a problem after the move to async/await but will be a little harder to observe as such.
I would like to invite everyone in the community to flip that around and think of non-encrypted blocks as secondary use case, otherwise I fear we'll end up building an ecosystem where security & privacy are afterthoughts.
The question isn’t about what use cases are secondary vs primary, it’s about what layer of the stack those use cases belong in and if those interfaces can be synchronous. Even if the codec format layer goes sync (there’s a separate thread about that and I’m not sure it will) the Block encode/decode methods will still need to be async because get-format
MUST be async because in the browser it needs to dynamically load the codec. Whatever the perf penalty an async wrap imposes on this it’s nothing compared to the eventual default bundle size we’ll have if we have to bundle every codec you ever might need ;) Note: we could improve the performance of this in the future by caching the module return and using the raw promises interface, making the encode/decode interfaces lose their async wrap when the encode/decode and the codec lookup happen synchronously.
So, this block interface will still have async methods for encode/decode/reader, the remaining question is whether we can keep the individual read methods in the Reader interface sync, and I think we can. At the Block layer interface it’s reasonable to expect that any necessary decryption for partial reads would be done up-front. The encryption use cases I’ve seen so far using multiple blocks (with inline blocks) which means that reading into an encrypted property is a multiblock operation and the block will need to be decrypted atomically which means you can do the individual reads on the block synchronously.
Does this penalty apply when I do
let x = await syncMethod()
?
That async function will be suspended on await
regardless of what syncMethod
returns. I know there had being some discussions on spider-monkey team to do a clever analyses to not suspend when it would not be perceivable, but don't know exactly they landed on that.
So the way I see it (please correct me if I'm mistaken) only thing that reader()
gives you is an ability to do one async call to reader()
be then be able to do .get()
, .tree()
, .links()
without further awaiting. Implying that there is an expectation that one would need to do multiple of those and that why it matters.
However I'm really not seeing how I could accomplish providing sync get()
, tree()
, links()
without actually going and decoding the whole thing. Or in cases where I can do that, I don't expect I'd need to do async operation to create a reader.
It very much could be that use cases I have in mind is "multiblock" (I don't actually know what that is), but from where I stand reader()
is forces me to use more complicated interface without allowing me to optimize anything.
I am also doing some experimentation with this API & forked version of it as I think it would help me better understand my own and assumed constraints. Hopefully I'll have some more evidence based feedback later this week.
I think it might be best to try and implement Flatbuffer IPLD codec as that would provide a specific example of encoding where blocks could be traversed without decoding & each field could be decoded without having to decode anything else.
A quick look at the JS implementation of Flabuffers https://github.com/google/flatbuffers/blob/master/js/flatbuffers.js and it looks like everything is synchronous.
At the Block layer interface it’s reasonable to expect that any necessary decryption for partial reads would be done up-front. The encryption use cases I’ve seen so far using multiple blocks (with inline blocks) which means that reading into an encrypted property is a multiblock operation and the block will need to be decrypted atomically which means you can do the individual reads on the block synchronously.
Ok so let's say I want to express SSB with IPLD, here is specific message format used there:
If I were to map it to IPLD node this is more or less what it would look like:
type PrivateMessage<message> = {
nonce:string,
key:BodyPublicKey
recepients: Concat<Encypted<Concat<[RecepientCount, BodyPrivateKey]>, RecepientPublicKey>[]>
body:Encrypted<message, BodyPrivateKey>
}
Now it is true that it would be reasonably to redesign format such body
will be own inline block and recepients
as well. But the problem is that would require convincing SSB community to switch to new format & even then there will be no way to handle messages encoded in prior format.
This is just an one example that proposed reader
interface can not handle well - that is I can't just do .get('key')
without having to go and decrypt everything. I do however thing that this is not the only case many encrypted protocols do use similar message formats because they did not had multihash, CID, multicodec as design constraint. I really want to be able to support arbitrary formats like these with IPLD instead of rewriting all the existing software to match those constraints.
A quick look at the JS implementation of Flabuffers https://github.com/google/flatbuffers/blob/master/js/flatbuffers.js and it looks like everything is synchronous.
I'm aware of that & I think I've mentioned that in other threads that it is combination of two constraints that make it problematic:
Because than you can traverse provide (links, tree) and access individual fields (get) without having to do both.
@mikeal does https://github.com/ipld/js-ipld-stack/issues/7#issuecomment-481395037 clarify my argument at all ? I think it's better & simpler example to focus on.
@Gozala sort of, but I already had this use case in mind. The thing is, the Data Model doesn’t know what an “encrypted field” is. The way encryption will have to be layered in here is through schemas, somehow, and that’s a layer above this Block interface. The schema is what will tell a traverser that “properties are interpreted differently in this data structure” and the schema layer will have to do most things async because many schemas are built for multi-block data structures.
For instance, we’ll have a schema for a hamt
, which is a multi-block Map data structure, and it will tell the traverser “This is a thing called hamt-v1
, use an implementation of that data structure to lookup keys in it.” The Reader
interface is intended for use by the implementor of that data structure, the traverser will essentially hand it these block instances with this reader interface and will get back values looked up through multiple layers of links in the graph.
I expect encryption to work similarly, Reader
is a block level tool for the decryption program to use to get at data in the block but it will be returning values through a different async
interface, not through the Reader
interface itself. In other words, an encrypted data structure implementor is a consumer of the Reader
interface, not an implementor, and they’ll be implementing a different async interface we haven’t defined yet because we’re still working on finalizing the Block level interface 😊
As far as the IPLD Data Model (layer graphic, spec) is concerned, an encrypted field is either:
Anything else is going to have to happen a layer up, which right now we think will be using IPLD Schemas (we’re still working on specifying this, the first test will probably be hamt
).
I think I found to distill this into a simpler example. Imagine message format that contains three fields a
, b
, c
and is encoded as <a><b><c>
. It is possible to decode a
without any asynchronicity. However to read b
you need to read a
(which is a secretKey) that you need to decode contents of b
there for decoding b
is async OP. As of c
you need another key from b
to decrypt that as well - so it's also async & it's encrypted with diff key - meaning that sometimes b
is enough and decoding c
would be a waste. At other times you may don't even care about b
or c
because a
has metadata telling you it's not for you.
@mikeal thanks for explaining (I posted above comment before yours got through). I have to admit I do not fully understand how two layers would fit together, but I'm getting a sense that what you're telling me is that Block
for that example would have to be represented like:
type PrivateMessage<a> = { nonce:string, key:BodyPublicKey, secret:SecretBox<a> }
type SecretBox<a> = { recepients:Iterator<Buffer>, body(BodyPrivateKey):a }
So that get('secret')
would give me some bytes and associated schema somehow will know to decode it as SecretBox<a>
and yet another schema will know that if I was able to decrypt recipient (with my private key) than I can use that info to find body
offset and decrypt it with that key.
This makes me wonder if decode / encode or any of the reader methods even need to be async. I am starting to think they don't in fact all the existing IPLD resolvers probably be implemented as such.
I am also somewhat puzzled about schema / codec separation mostly because I imagine them interconnected. I mean I expect schema will guide traversal on how to interpret the block, but then sometimes (like in discussed example) information to guide this is aggregated from data extracted from other fields. It almost feels to me that described use case doesn't even need codex and it only needs a schema.
@mikeal where can I find schema for hamt or read up on multi-block Map data structures ?
@Gozala today is a good day to ask that question :) Rod just wrote up a good proposal for how we talk about collections (multi-block data structures) https://github.com/ipld/specs/pull/110/files#diff-40c75a0f92193756e5dfa92bdeaf2165R103
There’s also a hamt
spec but it’s very specific to cbor
right now because we haven’t yet landed the schema spec which is what provides a lot of the language to talk about this in a way that isn’t codec specific https://github.com/ipld/specs/pull/109
Also, @warpfork has been working on schemas, some examples are here https://github.com/ipld/go-ipld-prime/tree/master/typed/declaration and he gave a talk on it yesterday that we recorded (I invited you but I don’t think you were able to make it) https://protocol.zoom.us/recording/share/mq_shATA6zhzI-XYlSsUMTveKnMCgHlDuTiqL8xVwr2wIumekTziMw
Watched presentation from @warpfork (which made me wondering if he's aware of https://typedefs.com/about/) which is really interesting, but I still don't understand how one would go about extracting bits from one field of the struct to use those for decoding another bit. I am suspecting that briefly mentioned Advanced Data Model might be a way to address this, but still unclear how one would do it. I'll follow other pointers to see if I can get a better idea, but if someones is able to enlighten me with specific answers I'd really appreciate it.
The sync-reader and async decoder model still seems to work if we're encoding encrypted components within CIDs. So a synchronous reader.get()
that encounters a CID just returns { value, remaining }
as it does now, you take value
, which is a CID, and it gets passed through the standard asynchronous fetch-by-CID-and-decode process where you can handle decryption asynchronously if needed and you then get back a synchronous reader on top of a decrypted block.
So maybe the focus to support seamless (or near seamless) encrypted blocks needs to be on the fetch-by-CID process.
let block = await dag.get(encrypted) // where `encrypted` is a CID with inline encrypted content
That get()
needs the smarts to understand that this is not an identifier of remote content but content itself. Do we need to create a new abstraction around the fetch process to handle this, that IPFS can plug in to but could also be backed by any other kind of store?
I burned through the bandwidth I had available for this, so I'd like to summarize where I stand with this as I don't know if I'll have a chance to contribute to the new API in near future.
Block
API with sync Reader
API. I have an impression that constraints codec implementer and in some cases may lead to suboptimal (code or perf) results. To be I don't think constraining API is necessarily bad, it's just I don't understand value proposition / motivation here. Mostly discussions were around not wasting ticks but than again that could be addressed differently e.g. https://github.com/ipld/js-ipld-stack/issues/7#issuecomment-481107429Hope this feedback is helpful
"Advanced Data Model / Schemas" could indeed be helpful for non-trivial decode operations, I've been working on how this might operate and potentially be somewhat transparent to the user and might post some initial thoughts into ipld/specs. I'm mainly focused on block-spanning collections but advanced within-block operations that require additional logic is a good use-case to keep in mind up front.
Regarding the async thing--the trade-off for performance is in the very awkward Promise-checking code required for fast-path. I'm coming to the conclusion that it's worth it to make sure we get encryption as a first-class citizen in IPLD. If it's a bolt-on operation rather than something that can be done transparently then it's always going to be awkward, like adding the s
to http
. It's just going to mean that we take care when building on top of optionally-async APIs in performance-important paths.
It doesn’t need to be awkward it can be prettier:
Task.spawn(function* () {
const block = Block.decoder(bytes, "dag-cbor")
// can resume right away if yielded value isn't a promise or await and resume then
const cid = yield block.cid()
// get can return right away if field can be accessed sync or cached otherwise will await.
const field = block.get('secret')
// Can also do equivalent of `Promise.all` / `Promise.race` without awaiting
const [links, paths] = yield Task.all([block.links(), block.tree()])
})
Implementation of spawn
can decide when to resume generator (wait if it’s promise or resume right away if not)
I find it awkward having to have a
.reader()
method that returnReader
instance because use case I have in mind would pretty much always require me to get a reader. Why not just incorporateget
,links
,tree
intoBlock
?