API review - Githubissues

achingbrain commented 5 years ago

Proposal

To change the return type and codify the behaviour of this module.

Current API

const exporter = require('ipfs-unixfs-exporter')
const pull = require('pull-stream')
const through = require('pull-stream/throughs/through')
const onEnd = require('pull-stream/sinks/on-end')

pull(
  exporter('QmFoo.../bar/baz.txt', ipld)
  through(file => {
    console.info(file)

  // {
  //   depth: 2,
  //   name: 'baz.txt',
  //   path: 'QmFoo.../bar/baz.txt',
  //   size: ...
  //   hash: Buffer
  //   content: <Pull stream>
  //   type: 'file'
  // }

  }),
  onEnd(...)
)

Observations

If the user wants any extra information about the node, they must turn the hash into a CID then pass it to an IPLD resolver, but we already had the CID and DAGNodes during the graph traversal so it's just unnecessary work
Using external libraries dictates how the user should structure their code and their choice of dependencies. They came here for the files, just give them the files

Proposal

Emit entries with object versions of CIDs and nodes
Use core JS types, e.g. async iterators

const exporter = require('ipfs-unixfs-exporter')

for await (const entry of exporter('QmFoo.../bar/baz.txt', ipld)) {
  console.info(entry)

  // {
  //   depth: 2,
  //   name: 'baz.txt',
  //   path: 'QmFoo.../bar/baz.txt',
  //   cid: CID
  //   node: DAGNode (cid.codec === 'dag-pb') ||  Object (cid.codec === 'dag-cbor') || Buffer (cid.codec === 'raw')
  //   content: Function // returns an async iterable of file or directory contents
  //   type: 'file' // or 'directory'
  // }

  for await (const content of entry.content()) {
    // content is either a buffer (when entry.type === 'file') or an entry (when entry.type === 'directory')
  }
}

Options

{
  offset: Number,  // If a file is at the end of the path `.content()` will start at this offset
  length: Number, // If a file is at the end of the path `.content()` will only return this many bytes
  fullPath: Boolean, // Emit entries for every path component
  maxDepth: Boolean // Only descend this number of components into the path
}

These would remain the same.

Sharding

Sharded directories are treated like normal directories. That is, they have a type of directory.

Behaviour

When requesting a path: /ipfs/Qmfoo/bar/baz, if baz resolves to a directory, the contents of the directory will be emitted, otherwise baz itself will be emitted.

If maxDepth is specified and matches the zero-indexed number of path components (excluding /ipfs), baz will be emitted even if it is a directory.

If fullPath is specified, each path component encountered will be emitted (excluding /ipfs).

Examples

When `baz` is a directory

exporter('/ipfs/Qmfoo/bar/baz')

// emits the files contained in `baz`

exporter('/ipfs/Qmfoo/bar/baz', {
  fullPath: true
})

// emits `Qmfoo`, `bar`, `baz`, then the files contained in `baz`

exporter('/ipfs/Qmfoo/bar/baz', {
  maxDepth: 2
})

// emits`baz`

exporter('/ipfs/Qmfoo/bar/baz', {
  fullPath: true,
  maxDepth: 2
})

// emits `Qmfoo`, `bar`, `baz`

When `baz` is a file

exporter('/ipfs/Qmfoo/bar/baz')

// emits `baz`

exporter('/ipfs/Qmfoo/bar/baz', {
  fullPath: true
})

// emits `Qmfoo`, `bar`, `baz`

exporter('/ipfs/Qmfoo/bar/baz', {
  maxDepth: 2
})

// emits`baz`

exporter('/ipfs/Qmfoo/bar/baz', {
  fullPath: true,
  maxDepth: 2
})

// emits `Qmfoo`, `bar`, `baz`

achingbrain commented 5 years ago

cc @ipfs/javascript-team

alanshaw commented 5 years ago

When entry is a directory, does content() get the files in the directory?

achingbrain commented 5 years ago

I haven't come across a situation where I needed that, but I can see how it'd be useful, will add that in.

achingbrain commented 5 years ago

offset and length in that case might mean slices of files?

Maybe they should be passed into the content() function?

E.g.

File:

const exporter = require('ipfs-unixfs-exporter')

for await (const entry of exporter('QmFoo.../bar/baz.txt', ipld)) {
  for await (const content of entry.content({
    offset: 0,
    length: 10
  })) {
    // content is a buffer with the first 10 bytes of `baz.txt`
  }
}

Dir:

const exporter = require('ipfs-unixfs-exporter')

for await (const entry of exporter('QmFoo.../bar', ipld)) {
  for await (const entry of entry.content({
    offset: 0,
    length: 10
  })) {
    // entry will be one of (up to) the first 10 files in the directory 
  }
}

vmx commented 5 years ago

When baz is a directory

What happens if the maxDepth is greater than the depth?

exporter('/ipfs/Qmfoo/bar/baz', {
  maxDepth: 3
})

Will it "emit baz" or "the files contained in baz"?

If it "emit baz"s, then you could use maxDepth: Infinity to always "emit baz".

lidel commented 5 years ago

Sharded directories are treated like normal directories. That is, they have a type of directory.

Just to ensure I got this right: the idea is to hide HAMT structures and always expose flattened one, indistinguishable from regular directory, right? ("They came here for the files, just give them the files")

achingbrain commented 5 years ago

What happens if the maxDepth is greater than the depth?

The intention behind maxDepth is to allow halting the graph traversal early. My thinking is if maxDepth is greater than the possible depth, it'll just do what it would have done if you hadn't specified maxDepth, which is the same as the current behaviour. From your example it would emit the files contained in baz.

The idea is to hide HAMT structures and always expose flattened one, indistinguishable from regular directory, right?

Yes, I just meant that the user wouldn't have to do anything different to extract files from a sharded directory than from a normal one, similar to what it does now, see dir-hamt-sharded.js and dir-flat.js.

If people think it's useful we can include a sharded flag or similar, alternatively you can do ipfs.files.stat(path) and it'll come back with a type of hamt-sharded-directory (works for both MFS and IPFS paths).

achingbrain commented 5 years ago

Actually, you also have the node property, which if it's a dag-pb node, the data from which can be passed through UnixFS.unmarshal(buf) and it'll tell you if it's a sharded directory or not.

I guess we could include the unmarshaled UnixFS entry too, since we've unmarshaled it during our traversal already?

achingbrain commented 5 years ago

I guess actually if we include the unmarshaled UnixFS entry we can omit some of the other fields like type and size 🤔

alanshaw commented 5 years ago

Is maxDepth necessary? You can just break out of the loop after the nth iteration right?

achingbrain commented 5 years ago

Good point. Thinking a bit more about this:

FSEntry

{
  depth: 2,
  name: 'baz.txt',
  path: 'QmFoo.../bar/baz.txt',
  cid: CID
  node: DAGNode (cid.codec === 'dag-pb') ||  Object (cid.codec === 'dag-cbor') || Buffer (cid.codec === 'raw')
  content: Function // returns an async iterable of file or directory contents
  entry: UnixFS // or null if cid.codec === 'raw' https://github.com/ipfs/js-ipfs-unixfs
}

API

Export a node

const exporter = require('ipfs-unixfs-exporter')
const result = await exporter('/ipfs/Qmfoo/bar', ipld)

console.info(`Exported a ${result.entry.type}`)

for await (const content of result.content({ offset, length })) {
  // if `entry` was a file, `content` is a buffer, if it was a directory, `content` is an FSEntry
}

Export all nodes on the path to a node

const exporter = require('ipfs-unixfs-exporter')

for await (const fsEntry of exporter.path('/ipfs/Qmfoo/bar', ipld)) {
  // ...
}

🤖 Future tech zone

Wildcards?

const exporter = require('ipfs-unixfs-exporter')

for await (const fsEntry of exporter('/ipfs/Qmfoo/bar/**/*.js', ipld)) {
  // ...
}

Search?

const exporter = require('ipfs-unixfs-exporter')

for await (const fsEntry of exporter('/ipfs/Qmfoo/bar', {
  query: 'some search string'
}, ipld)) {
  // ... can be combined with globs, above
}

vmx commented 5 years ago

I like symmetric/consistent APIs. Having

const result = await exporter('/ipfs/Qmfoo/bar', ipld)
for await (const content of result.content({ offset, length })) {}

but on the other hand

for await (const fsEntry of exporter.path('/ipfs/Qmfoo/bar', ipld)) {}

seems strange. Without thinking about how to implement it, I'd rather do either

const result = await exporter('/ipfs/Qmfoo/bar', ipld)
for await (const content of result.content({ offset, length })) {}
for await (const fsEntry of result.path() {}

or

for await (const content of exporter.content('/ipfs/Qmfoo/bar', ipld, { offset, length })) {}
for await (const fsEntry of exporter.path('/ipfs/Qmfoo/bar', ipld)) {}

or something similar. I guess you got the point.

achingbrain commented 5 years ago

I think the second option is ok:

for await (const content of exporter.content('/ipfs/Qmfoo/bar', ipld, { offset, length })) {}
for await (const fsEntry of exporter.path('/ipfs/Qmfoo/bar', ipld)) {}

This will be expensive as you'd have to do the traversal twice, once to return the initial result and once to return the path, or you store it in memory as you go which doesn't sound great either:

const result = await exporter('/ipfs/Qmfoo/bar', ipld)
for await (const content of result.content({ offset, length })) {}
for await (const fsEntry of result.path() {}

I think you'd end up doing both:


const result = await exporter('/ipfs/Qmfoo/bar', ipld) // allows inspection of `result` properties
for await (const content of result.content({ offset, length })) {}

// shortcut for the above, just give me the content!
for await (const content of exporter.content('/ipfs/Qmfoo/bar', ipld, { offset, length })) {}

// also this, i want to know how i got to '/ipfs/Qmfoo/bar'
for await (const fsEntry of result.path('/ipfs/Qmfoo/bar', ipld) {}

vmx commented 5 years ago

This will be expensive as you'd have to do the traversal twice, once to return the initial result and once to return the path, or you store it in memory as you go which doesn't sound great either:

That's not what I meant. This shouldn't be read as one program, but as two cases. I want either the content or the path.

// also this, i want to know how i got to '/ipfs/Qmfoo/bar'
for await (const fsEntry of result.path('/ipfs/Qmfoo/bar', ipld) {}

Wouldn't this be just result.path()?

alanshaw commented 5 years ago

FSEntry

{
  depth: 2,
  name: 'baz.txt',
  path: 'QmFoo.../bar/baz.txt',
  cid: CID
  node: DAGNode (cid.codec === 'dag-pb') ||  Object (cid.codec === 'dag-cbor') || Buffer (cid.codec === 'raw')
  content: Function // returns an async iterable of file or directory contents
  entry: UnixFS // or null if cid.codec === 'raw' https://github.com/ipfs/js-ipfs-unixfs
}

API

Export a node

const exporter = require('ipfs-unixfs-exporter')
const result = await exporter('/ipfs/Qmfoo/bar', ipld)

console.info(`Exported a ${result.entry.type}`)

for await (const content of result.content({ offset, length })) {
  // if `entry` was a file, `content` is a buffer, if it was a directory, `content` is an FSEntry
}

Export all nodes on the path to a node

const exporter = require('ipfs-unixfs-exporter')

for await (const fsEntry of exporter.path('/ipfs/Qmfoo/bar', ipld)) {
  // ...
}

I really like all of this.

Would the exporter.content function provide any significant performance benefit (or any other benefit) other than saving 1 line of code? If not I'd probably drop it.

My only worry is your future tech section, where exporter() is returning an iterable not a promise - how would we handle that in the future? Right now you'll always get one entry, so it makes sense to return a promise not an iterable.

alanshaw commented 5 years ago

@vmx

Wouldn't this be just result.path()?

Yeah I think so I'm assuming that's a copy paste error. However Alex's point on it being expensive still stands.

achingbrain commented 5 years ago

This shouldn't be read as one program, but as two cases Wouldn't this be just result.path()?

Yes, I also meant that as two cases and not one program. I think we're on the same page here 😄

vmx commented 5 years ago

I don't understand the expensiveness problem of

for await (const content of result.content({ offset, length })) {}
for await (const fsEntry of result.path() {}

Couldn't the exporter be lazy and just initialise without doing any actual work?

achingbrain commented 5 years ago

Calling .content() on a FSEntry is cheap as we've already resolved the FSEntry's CID during the initial traversal and use that to export the content.

Calling .path() on an FSEntry will mean either making the traversal again or caching it in memory just in case the user calls .path().

I think if we put .path() on FSEntry it encourages the user to call it in a loop which may result in surprise when their program does not perform as expected.

For that reason I think we should stick .path() on the exporter root only:

for await (const fsEntry of exporter.path('/ipfs/Qmfoo/bar', ipld)) {}

They can obviously still call it in a loop, but it's not a natural thing to do as it doesn't get given to you by an iterable.

achingbrain commented 5 years ago

Would the exporter.content function provide any significant performance benefit?

No, it's just syntactic sugar.

My only worry is your future tech section, where exporter() is returning an iterable not a promise - how would we handle that in the future?

I was thinking to look at the string and return a promise that resolved to an iterable if a wildcard character was seen or a search query was passed.

We might end up having them as proper methods on the exporter root instead. The other solution would be making it be an iterator all the time which is kind of what we have now (in that it's always a pull stream) which doesn't feel very ergonomic.

Anyway, it's all future tech, we can solve that problem on a separate issue.

vmx commented 5 years ago

@achingbrain thanks for making things more explicit. At https://github.com/ipfs/js-ipfs-unixfs-exporter/issues/7#issuecomment-443226427 I meant that exporter('/ipfs/Qmfoo/bar', ipld) wouldn't return an FSEntry, but some "Exporter object". You would then call either content() or path() on it.

achingbrain commented 5 years ago

This has been implemented.

ipfs-inactive / js-ipfs-unixfs-exporter

API review #7

Proposal

Current API

Observations

Proposal

Options

Sharding

Behaviour

Examples

When `baz` is a directory

When `baz` is a file

FSEntry

API

Export a node

Export all nodes on the path to a node

🤖 Future tech zone

Wildcards?

Search?

FSEntry

API

Export a node

Export all nodes on the path to a node

ipfs-inactive / js-ipfs-unixfs-exporter

API review #7

Proposal

Current API

Observations

Proposal

Options

Sharding

Behaviour

Examples

When baz is a directory

When baz is a file

FSEntry

API

Export a node

Export all nodes on the path to a node

🤖 Future tech zone

Wildcards?

Search?

FSEntry

API

Export a node

Export all nodes on the path to a node

When `baz` is a directory

When `baz` is a file