Proposal: Multiblock encoder interface

Gozala commented 2 years ago

We're running into more and more cases where BlockEncoder interface just does not fit the bill:

With IPNFT geared towards NFTs we've discovered that NFT metadata can easily exceed 1MiB size which would hinder our ability to serve such blocks on gateway etc....
With new UnixFS code we basically want pass file and get set of blocks with a root back.
Now with UCANs we want to pass auth chain and produce block per link in chain.

I am sure I'm forgetting some and we are likely to encounter more use cases where we want to turn some input into a DAG represented by many blocks. Which is why I would like to propose adopting following interfaces:

export interface SyncDAGEncoder<Code extends number = number, T extends unknown = unknown> {
  encoder(data:T): IterableIterator<{ code: Code, bytes: Uint8Array }>
}

export interface AsyncDAGEncoder<Code extends number = number, T extends unknown = unknown> {
    encoder(data:T): AsyncIterableIterator<{ code: Code, bytes: Uint8Array }>
}

export type DAGEncoder<Code extends number = number, T extends unknown = unknown> =
  | SyncDAGEncoder<Code, T>
  | AsyncDAGEncoder<Code, T>

Last block would be a DAG root block (which is natural due to hash linking)

Such interfaces would cover all above use cases. Additionally we could make all our block codecs implement these interface too making them compatible.

Gozala commented 2 years ago

I'm realizing now that above proposed API is not great and pretty much will never be sync because CIDs need to be computed in order to build a DAG. I think what would make more sense is to represent DAGs with clearly denoted block boundaries, however it would be difficult to generalize this and maybe it would be best not to. Maybe instead encoder interface could be expanded to allow recognizing what needs to be linked e.g.

interface DAGIterator<T extends unknown = unknown> {
   iterate <U>(data:T): IterableIterator<{ encoder:  BlockEncoder<number, U>, data: U }>
}

Such thing could be used to:

Pass in value that needs encoding
Iterate over the parts that need to be broken out
After all parts are encoded continue with encoding actual value substituting all the parts with corresponding links

rvagg commented 2 years ago

@Gozala I'm not quite following you on the last comment there; is it the order that's a problem? I get that that sync API is a problem, but beyond that why are you wanting to have an iterator of encoders? I don't quite see what problem that's solving.

Also what is U in your iterate() generic?

Also 2 .. it's kind of amusing to see you here, and in dag-ucan, essentially having to re-invent the whole ADL concept after we went through the dramas of disagreements in the IPLD team re their utility. I really think it'd be worth taking another look at whether there's a path to doing something sensible in JS on this front, and perhaps what you're getting at here is part of that (in Go, the write-side of ADLs are the least mature part, there's some messy mechanics and plumbing). I started tinkering with a new JS stack to try and better encompass these ideas a while back but it's been another one of those projects that get lost in the too-many-more-important-things-to-do rush.

Gozala commented 2 years ago

@Gozala I'm not quite following you on the last comment there; is it the order that's a problem? I get that that sync API is a problem, but beyond that why are you wanting to have an iterator of encoders? I don't quite see what problem that's solving.

Problem is that SyncDAGEncoder / AsyncDAGEncoder only emitted { code, bytes } and a thing consuming it may generate different CIDs (due to different hashing alg) than the ones that parent node will use to reference it's children.

Also what is U in your iterate() generic?

Yeah it is generic basically telling that type of data field is the same as type of data of the encoder as per

https://github.com/multiformats/js-multiformats/blob/9bcd7fef62888d7cefe8e4f5e929d4e3c9dadda9/src/codecs/interface.ts#L4-L8

Also 2 .. it's kind of amusing to see you here, and in dag-ucan, essentially having to re-invent the whole ADL concept after we went through the dramas of disagreements in the IPLD team re their utility. I really think it'd be worth taking another look at whether there's a path to doing something sensible in JS on this front, and perhaps what you're getting at here is part of that (in Go, the write-side of ADLs are the least mature part, there's some messy mechanics and plumbing). I started tinkering with a new JS stack to try and better encompass these ideas a while back but it's been another one of those projects that get lost in the too-many-more-important-things-to-do rush.

Happy to amuse :P More seriously, we really need a way to represent things that span multiple blocks that can be packed into CAR(s) in a generic way. It does sound like ADLs, but then again they don't seem to have a very concrete definition.

multiformats / js-multiformats

Proposal: Multiblock encoder interface #175