ipfs / js-ipfs-unixfs

JavaScript implementation of IPFS' unixfs (a Unix FileSystem representation on top of a MerkleDAG)
Other
88 stars 34 forks source link

Reduce number of dependencies needed for importer #186

Closed Gozala closed 1 year ago

Gozala commented 2 years ago

Context

I have been working on https://github.com/nftstorage/nft.storage/issues/837 and run into complications with utilizing importer, because it requires a blockstore implementation

https://github.com/ipfs/js-ipfs-unixfs/blob/07f244ad23ebfd11e5310ab83aee19d8cd006dfa/packages/ipfs-unixfs-importer/src/index.js#L24-L29

Which unfortunately isn't a simple API to supply, given that it has large number of methods

export interface Blockstore extends Store<CID, Uint8Array> {}

export interface Store<Key, Value> {
  open: () => Promise<void>
  close: () => Promise<void>
  put: (key: Key, val: Value, options?: Options) => Promise<void>
  get: (key: Key, options?: Options) => Promise<Value>

  has: (key: Key, options?: Options) => Promise<boolean>
  delete: (key: Key, options?: Options) => Promise<void>
  putMany: (
    source: AwaitIterable<Pair<Key, Value>>,
    options?: Options
  ) => AsyncIterable<Pair<Key, Value>>
  getMany: (
    source: AwaitIterable<Key>,
    options?: Options
  ) => AsyncIterable<Value>
  deleteMany: (
    source: AwaitIterable<Key>,
    options?: Options
  ) => AsyncIterable<Key>

  batch: () => Batch<Key, Value>
  query: (query: Query<Key, Value>, options?: Options) => AsyncIterable<Pair<Key, Value>>
  queryKeys: (query: KeyQuery<Key>, options?: Options) => AsyncIterable<Key>
}

https://github.com/ipfs/js-ipfs-interfaces/blob/17a18d9af34a39ea7b066d523893c3254439f50b/packages/interface-blockstore/src/index.ts#L29-L31 https://github.com/ipfs/js-ipfs-interfaces/blob/17a18d9af34a39ea7b066d523893c3254439f50b/packages/interface-store/src/index.ts#L23-L175

I also suspect that importer does not needs all of those methods to do it's job. Given it's name I would expect it probably needs subset of write API.

Proposal

Option 1

I would like to propose to loosen up requirements on the importer, so something like BlockWriter or possibly CarEncoder could be used instead.

Option 2

It seems that importer does two tasks

  1. Importing blocks into the blockstore
  2. Spitting created unixfs entries out

It may be a good idea to untangle dag assembly from importing E.g maybe we could refactor API such that instead of writing blocks and emitting unixfs entries it would emit entries that have own block iterators so they could be written into blockstore as needed.

I realize it's kind of the case already given that dag builder passes things to tree builder which then flushes things into blockstore

https://github.com/ipfs/js-ipfs-unixfs/blob/07f244ad23ebfd11e5310ab83aee19d8cd006dfa/packages/ipfs-unixfs-importer/src/tree-builder.js#L83-L116

rvagg commented 2 years ago

Eric just redid all the Go storage interfaces in an effort to minimise all the cruft that's built up for the various layers: https://pkg.go.dev/github.com/ipld/go-ipld-prime/storage#pkg-types

Aside from having adapters for all the existing storage layers, the main aim here is to do feature detection and minimal interfaces. The basic Storage is just a Has(). WritableStorage adds a Put(), ReadableStorage adds a Get(), and then there's streaming and batch versions that are intended to only be used via feature detection, not hard-wiring interfaces to require them unless strictly necessary.

It'd be awesome if we could try and do something similar in JS because we have a similar (though slightly less dire) sprawl of interfaces for storage and they're mostly far too extensive for most uses.

mikeal commented 2 years ago

maybe we could refactor API such that instead of writing blocks and emitting unixfs entries it would emit entries that have own block iterators so they could be written into blockstore as needed.

would the generator that consumes the main iterator have to consume the block iterators for the main iterator to continue?

i remember using an interface like this in a tar library and there were some tradeoffs that had to be made wrt flow control.

It'd be awesome if we could try and do something similar in JS because we have a similar (though slightly less dire) sprawl of interfaces for storage and they're mostly far too extensive for most uses.

I don’t see why we can’t copy this for future block store abstractions, but I do want to avoid using block store abstractions at all when we can help it. I’d much prefer for encoder libraries like this to offer block generators.

Gozala commented 2 years ago

Searching through code I noticed there’s at least one place where we use read interface

https://github.com/ipfs/js-ipfs-unixfs/blob/4842293678bc66d4d97adf88da36435d70403088/packages/ipfs-unixfs-importer/src/dag-builder/file/index.js#L84

don’t have enough context yet to tell if the get really needed or if it’s accidental artifact

Gozala commented 2 years ago

Eric just redid all the Go storage interfaces in an effort to minimise all the cruft that's built up for the various layers: https://pkg.go.dev/github.com/ipld/go-ipld-prime/storage#pkg-types

Aside from having adapters for all the existing storage layers, the main aim here is to do feature detection and minimal interfaces. The basic Storage is just a Has(). WritableStorage adds a Put(), ReadableStorage adds a Get(), and then there's streaming and batch versions that are intended to only be used via feature detection, not hard-wiring interfaces to require them unless strictly necessary.

I recall splitting store interface like that myself when defining it in ts (but I guess it was undone later on) https://github.com/ipfs/js-ipfs/blame/34e14927f7b569d827426e8c269ce77e2a2ceba6/packages/ipfs-core-types/src/store.ts#L21-L61

Generally I find splitting interfaces by capabilities required to be good rule of thumb that leads to better separation of concerns.

Gozala commented 2 years ago

I have put bit more thought into this and I think this will speak to question above. Here are the specific wants as I see them:

  1. Want importer to spit out blocks as fast as it can.
  2. Want importer to apply back-pressure when blocks can not be consumed fast enough.
  3. Want importer to be decoupled from the block consumer (blockstore, car maybe more things in the future).
  4. Want importer to assemble DAG without having to wait on block consumption

    This is as far as I can tell not the case now. Flushing logic will not emit node until blocks are written into the blockstore. This probably makes sense for IPFS. For car writer probably not so much.

  5. Want congestion in block consumption not to result in excessive memory usage. (This may seem in contradiction with 4, but not exactly. What I mean here is that we need to have configurable buffer so that things are moving along as long as buffer isn't overflowing)

Crrent API fails to meet some of those goals because it produces output that emits FS entries and produces no ouptut for the blocks. Instead it stores blocks as a side effect and does them in a way that writing a block will hold the FS entry creation. Furthermore because it parallelizes things across files it becomes really difficult to coordinate memory use (or so it seems).

I think what we want here is instead of storing blocks as a side effect, to make that part of the output. That way another actor could flush those into block store or car file and have a flexibility to do it in batches, sequentially etc... If blocks aren't consumed from output that would halt importer & ideally API will provide form of queuing strategy so things can move along until queue is full & then halt until there is more space.

To accomplish that I think we could make importer return two separate outputs {entries, blocks} (that is so that blocks and entries could be consumed concurrently, specifically entry could be consumed until all the blocks from previous one).

Here are couple of options I can see with pros and cons listed

Take WritableStreamWriter to write into.

Return pair of ReadableStreams

Stick to async iterables

Custom API

This is was what I was suggesting here

It may be a good idea to untangle dag assembly from importing E.g maybe we could refactor API such that instead of writing blocks and emitting unixfs entries it would emit entries that have own block iterators so they could be written into blockstore as needed.

However after thinking more about it through the outlined wants, it is clear that, as proposed, it would not be a good API. Specifically FS entry would either need to be eager, producing blocks and holding those in memory until GC-ed (which doesn't meet one of our wants) or it would have to be lazy, in which case obtaining node for that entry would be awkward as it would require consuming blocks first. Furthermore it entangles block consumer with entry consumer which does not seem ideal either.

This should answer @mikeal's question (quoted below), and I no longer think it is a good approach.

would the generator that consumes the main iterator have to consume the block iterators for the main iterator to continue?

Writer API

Argument could be made that providing BlockWriter API isn't that different from providing WritableStreamWriter, better yet it would not require polyfills etc... Yet as @mikeal I also tend to be biased towards API that gives you blocks as opposed to writing those for you.

As a side note it may be a good idea to align BlockWriter with WritableStreamWriter API which would imply renaming put to write which in turn would make make WritableStreamWriter<Block> a valid BlockWriter.

achingbrain commented 1 year ago

Closing this as the importer now only requires the put method and the exporter only get.

https://github.com/ipld/js-unixfs also now exists which is closer to using web streams.