Closed Gozala closed 1 year ago
Eric just redid all the Go storage interfaces in an effort to minimise all the cruft that's built up for the various layers: https://pkg.go.dev/github.com/ipld/go-ipld-prime/storage#pkg-types
Aside from having adapters for all the existing storage layers, the main aim here is to do feature detection and minimal interfaces. The basic Storage
is just a Has()
. WritableStorage
adds a Put()
, ReadableStorage
adds a Get()
, and then there's streaming and batch versions that are intended to only be used via feature detection, not hard-wiring interfaces to require them unless strictly necessary.
It'd be awesome if we could try and do something similar in JS because we have a similar (though slightly less dire) sprawl of interfaces for storage and they're mostly far too extensive for most uses.
maybe we could refactor API such that instead of writing blocks and emitting unixfs entries it would emit entries that have own block iterators so they could be written into blockstore as needed.
would the generator that consumes the main iterator have to consume the block iterators for the main iterator to continue?
i remember using an interface like this in a tar
library and there were some tradeoffs that had to be made wrt flow control.
It'd be awesome if we could try and do something similar in JS because we have a similar (though slightly less dire) sprawl of interfaces for storage and they're mostly far too extensive for most uses.
I don’t see why we can’t copy this for future block store abstractions, but I do want to avoid using block store abstractions at all when we can help it. I’d much prefer for encoder libraries like this to offer block generators.
Searching through code I noticed there’s at least one place where we use read interface
don’t have enough context yet to tell if the get really needed or if it’s accidental artifact
Eric just redid all the Go storage interfaces in an effort to minimise all the cruft that's built up for the various layers: https://pkg.go.dev/github.com/ipld/go-ipld-prime/storage#pkg-types
Aside from having adapters for all the existing storage layers, the main aim here is to do feature detection and minimal interfaces. The basic Storage is just a Has(). WritableStorage adds a Put(), ReadableStorage adds a Get(), and then there's streaming and batch versions that are intended to only be used via feature detection, not hard-wiring interfaces to require them unless strictly necessary.
I recall splitting store interface like that myself when defining it in ts (but I guess it was undone later on) https://github.com/ipfs/js-ipfs/blame/34e14927f7b569d827426e8c269ce77e2a2ceba6/packages/ipfs-core-types/src/store.ts#L21-L61
Generally I find splitting interfaces by capabilities required to be good rule of thumb that leads to better separation of concerns.
I have put bit more thought into this and I think this will speak to question above. Here are the specific wants as I see them:
This is as far as I can tell not the case now. Flushing logic will not emit node until blocks are written into the blockstore. This probably makes sense for IPFS. For car writer probably not so much.
Crrent API fails to meet some of those goals because it produces output that emits FS entries and produces no ouptut for the blocks. Instead it stores blocks as a side effect and does them in a way that writing a block will hold the FS entry creation. Furthermore because it parallelizes things across files it becomes really difficult to coordinate memory use (or so it seems).
I think what we want here is instead of storing blocks as a side effect, to make that part of the output. That way another actor could flush those into block store or car file and have a flexibility to do it in batches, sequentially etc... If blocks aren't consumed from output that would halt importer & ideally API will provide form of queuing strategy so things can move along until queue is full & then halt until there is more space.
To accomplish that I think we could make importer return two separate outputs {entries, blocks}
(that is so that blocks and entries could be consumed concurrently, specifically entry could be consumed until all the blocks from previous one).
Here are couple of options I can see with pros and cons listed
WritableStreamWriter
to write into.WritableStream
still aren't widely available and would require polyfills.ReadableStream
sThis has pros and cons on one hand it complicates API makes it importer concern. On the other hand user could specify
highWaterMark
inbytesLength
and importer would be in better position to implement a queuing strategy for the ReadableStream
Although argument could be made that same
highWaterMark
could be specified to implement one. At that point we'd be recreatingReadableSteram
s though and it's not as trivial as it may seem.
This is was what I was suggesting here
It may be a good idea to untangle dag assembly from importing E.g maybe we could refactor API such that instead of writing blocks and emitting unixfs entries it would emit entries that have own block iterators so they could be written into blockstore as needed.
However after thinking more about it through the outlined wants, it is clear that, as proposed, it would not be a good API. Specifically FS entry would either need to be eager, producing blocks and holding those in memory until GC-ed (which doesn't meet one of our wants) or it would have to be lazy, in which case obtaining node for that entry would be awkward as it would require consuming blocks first. Furthermore it entangles block consumer with entry consumer which does not seem ideal either.
This should answer @mikeal's question (quoted below), and I no longer think it is a good approach.
would the generator that consumes the main iterator have to consume the block iterators for the main iterator to continue?
Argument could be made that providing BlockWriter
API isn't that different from providing WritableStreamWriter
, better yet it would not require polyfills etc... Yet as @mikeal I also tend to be biased towards API that gives you blocks as opposed to writing those for you.
As a side note it may be a good idea to align
BlockWriter
withWritableStreamWriter
API which would imply renamingput
towrite
which in turn would make makeWritableStreamWriter<Block>
a validBlockWriter
.
Closing this as the importer now only requires the put
method and the exporter only get
.
https://github.com/ipld/js-unixfs also now exists which is closer to using web streams.
Context
I have been working on https://github.com/nftstorage/nft.storage/issues/837 and run into complications with utilizing importer, because it requires a blockstore implementation
https://github.com/ipfs/js-ipfs-unixfs/blob/07f244ad23ebfd11e5310ab83aee19d8cd006dfa/packages/ipfs-unixfs-importer/src/index.js#L24-L29
Which unfortunately isn't a simple API to supply, given that it has large number of methods
https://github.com/ipfs/js-ipfs-interfaces/blob/17a18d9af34a39ea7b066d523893c3254439f50b/packages/interface-blockstore/src/index.ts#L29-L31 https://github.com/ipfs/js-ipfs-interfaces/blob/17a18d9af34a39ea7b066d523893c3254439f50b/packages/interface-store/src/index.ts#L23-L175
I also suspect that importer does not needs all of those methods to do it's job. Given it's name I would expect it probably needs subset of write API.
Proposal
Option 1
I would like to propose to loosen up requirements on the importer, so something like
BlockWriter
or possiblyCarEncoder
could be used instead.Option 2
It seems that importer does two tasks
It may be a good idea to untangle dag assembly from importing E.g maybe we could refactor API such that instead of writing blocks and emitting unixfs entries it would emit entries that have own block iterators so they could be written into blockstore as needed.
I realize it's kind of the case already given that dag builder passes things to tree builder which then flushes things into blockstore
https://github.com/ipfs/js-ipfs-unixfs/blob/07f244ad23ebfd11e5310ab83aee19d8cd006dfa/packages/ipfs-unixfs-importer/src/tree-builder.js#L83-L116