Closed achingbrain closed 5 years ago
cc @ipfs/javascript-team
When entry is a directory, does content()
get the files in the directory?
I haven't come across a situation where I needed that, but I can see how it'd be useful, will add that in.
offset
and length
in that case might mean slices of files?
Maybe they should be passed into the content()
function?
E.g.
File:
const exporter = require('ipfs-unixfs-exporter')
for await (const entry of exporter('QmFoo.../bar/baz.txt', ipld)) {
for await (const content of entry.content({
offset: 0,
length: 10
})) {
// content is a buffer with the first 10 bytes of `baz.txt`
}
}
Dir:
const exporter = require('ipfs-unixfs-exporter')
for await (const entry of exporter('QmFoo.../bar', ipld)) {
for await (const entry of entry.content({
offset: 0,
length: 10
})) {
// entry will be one of (up to) the first 10 files in the directory
}
}
When
baz
is a directory
What happens if the maxDepth is greater than the depth?
exporter('/ipfs/Qmfoo/bar/baz', {
maxDepth: 3
})
Will it "emit baz
" or "the files contained in baz
"?
If it "emit baz
"s, then you could use maxDepth: Infinity
to always "emit baz
".
Sharded directories are treated like normal directories. That is, they have a type of
directory
.
Just to ensure I got this right: the idea is to hide HAMT structures and always expose flattened one, indistinguishable from regular directory
, right? ("They came here for the files, just give them the files")
What happens if the maxDepth is greater than the depth?
The intention behind maxDepth is to allow halting the graph traversal early. My thinking is if maxDepth is greater than the possible depth, it'll just do what it would have done if you hadn't specified maxDepth, which is the same as the current behaviour. From your example it would emit the files contained in baz
.
The idea is to hide HAMT structures and always expose flattened one, indistinguishable from regular directory, right?
Yes, I just meant that the user wouldn't have to do anything different to extract files from a sharded directory than from a normal one, similar to what it does now, see dir-hamt-sharded.js
and dir-flat.js
.
If people think it's useful we can include a sharded
flag or similar, alternatively you can do ipfs.files.stat(path)
and it'll come back with a type of hamt-sharded-directory
(works for both MFS and IPFS paths).
Actually, you also have the node
property, which if it's a dag-pb node, the data from which can be passed through UnixFS.unmarshal(buf)
and it'll tell you if it's a sharded directory or not.
I guess we could include the unmarshaled UnixFS entry too, since we've unmarshaled it during our traversal already?
I guess actually if we include the unmarshaled UnixFS entry we can omit some of the other fields like type
and size
π€
Is maxDepth
necessary? You can just break out of the loop after the nth iteration right?
Good point. Thinking a bit more about this:
{
depth: 2,
name: 'baz.txt',
path: 'QmFoo.../bar/baz.txt',
cid: CID
node: DAGNode (cid.codec === 'dag-pb') || Object (cid.codec === 'dag-cbor') || Buffer (cid.codec === 'raw')
content: Function // returns an async iterable of file or directory contents
entry: UnixFS // or null if cid.codec === 'raw' https://github.com/ipfs/js-ipfs-unixfs
}
const exporter = require('ipfs-unixfs-exporter')
const result = await exporter('/ipfs/Qmfoo/bar', ipld)
console.info(`Exported a ${result.entry.type}`)
for await (const content of result.content({ offset, length })) {
// if `entry` was a file, `content` is a buffer, if it was a directory, `content` is an FSEntry
}
const exporter = require('ipfs-unixfs-exporter')
for await (const fsEntry of exporter.path('/ipfs/Qmfoo/bar', ipld)) {
// ...
}
const exporter = require('ipfs-unixfs-exporter')
for await (const fsEntry of exporter('/ipfs/Qmfoo/bar/**/*.js', ipld)) {
// ...
}
const exporter = require('ipfs-unixfs-exporter')
for await (const fsEntry of exporter('/ipfs/Qmfoo/bar', {
query: 'some search string'
}, ipld)) {
// ... can be combined with globs, above
}
I like symmetric/consistent APIs. Having
const result = await exporter('/ipfs/Qmfoo/bar', ipld)
for await (const content of result.content({ offset, length })) {}
but on the other hand
for await (const fsEntry of exporter.path('/ipfs/Qmfoo/bar', ipld)) {}
seems strange. Without thinking about how to implement it, I'd rather do either
const result = await exporter('/ipfs/Qmfoo/bar', ipld)
for await (const content of result.content({ offset, length })) {}
for await (const fsEntry of result.path() {}
or
for await (const content of exporter.content('/ipfs/Qmfoo/bar', ipld, { offset, length })) {}
for await (const fsEntry of exporter.path('/ipfs/Qmfoo/bar', ipld)) {}
or something similar. I guess you got the point.
I think the second option is ok:
for await (const content of exporter.content('/ipfs/Qmfoo/bar', ipld, { offset, length })) {}
for await (const fsEntry of exporter.path('/ipfs/Qmfoo/bar', ipld)) {}
This will be expensive as you'd have to do the traversal twice, once to return the initial result and once to return the path, or you store it in memory as you go which doesn't sound great either:
const result = await exporter('/ipfs/Qmfoo/bar', ipld)
for await (const content of result.content({ offset, length })) {}
for await (const fsEntry of result.path() {}
I think you'd end up doing both:
const result = await exporter('/ipfs/Qmfoo/bar', ipld) // allows inspection of `result` properties
for await (const content of result.content({ offset, length })) {}
// shortcut for the above, just give me the content!
for await (const content of exporter.content('/ipfs/Qmfoo/bar', ipld, { offset, length })) {}
// also this, i want to know how i got to '/ipfs/Qmfoo/bar'
for await (const fsEntry of result.path('/ipfs/Qmfoo/bar', ipld) {}
This will be expensive as you'd have to do the traversal twice, once to return the initial result and once to return the path, or you store it in memory as you go which doesn't sound great either:
That's not what I meant. This shouldn't be read as one program, but as two cases. I want either the content or the path.
// also this, i want to know how i got to '/ipfs/Qmfoo/bar'
for await (const fsEntry of result.path('/ipfs/Qmfoo/bar', ipld) {}
Wouldn't this be just result.path()
?
FSEntry
{ depth: 2, name: 'baz.txt', path: 'QmFoo.../bar/baz.txt', cid: CID node: DAGNode (cid.codec === 'dag-pb') || Object (cid.codec === 'dag-cbor') || Buffer (cid.codec === 'raw') content: Function // returns an async iterable of file or directory contents entry: UnixFS // or null if cid.codec === 'raw' https://github.com/ipfs/js-ipfs-unixfs }
API
Export a node
const exporter = require('ipfs-unixfs-exporter') const result = await exporter('/ipfs/Qmfoo/bar', ipld) console.info(`Exported a ${result.entry.type}`) for await (const content of result.content({ offset, length })) { // if `entry` was a file, `content` is a buffer, if it was a directory, `content` is an FSEntry }
Export all nodes on the path to a node
const exporter = require('ipfs-unixfs-exporter') for await (const fsEntry of exporter.path('/ipfs/Qmfoo/bar', ipld)) { // ... }
I really like all of this.
Would the exporter.content
function provide any significant performance benefit (or any other benefit) other than saving 1 line of code? If not I'd probably drop it.
My only worry is your future tech section, where exporter()
is returning an iterable not a promise - how would we handle that in the future? Right now you'll always get one entry, so it makes sense to return a promise not an iterable.
@vmx
Wouldn't this be just result.path()?
Yeah I think so I'm assuming that's a copy paste error. However Alex's point on it being expensive still stands.
This shouldn't be read as one program, but as two cases Wouldn't this be just result.path()?
Yes, I also meant that as two cases and not one program. I think we're on the same page here π
I don't understand the expensiveness problem of
for await (const content of result.content({ offset, length })) {}
for await (const fsEntry of result.path() {}
Couldn't the exporter be lazy and just initialise without doing any actual work?
Calling .content()
on a FSEntry
is cheap as we've already resolved the FSEntry
's CID during the initial traversal and use that to export the content.
Calling .path()
on an FSEntry
will mean either making the traversal again or caching it in memory just in case the user calls .path()
.
I think if we put .path()
on FSEntry
it encourages the user to call it in a loop which may result in surprise when their program does not perform as expected.
For that reason I think we should stick .path()
on the exporter root only:
for await (const fsEntry of exporter.path('/ipfs/Qmfoo/bar', ipld)) {}
They can obviously still call it in a loop, but it's not a natural thing to do as it doesn't get given to you by an iterable.
Would the exporter.content function provide any significant performance benefit?
No, it's just syntactic sugar.
My only worry is your future tech section, where exporter() is returning an iterable not a promise - how would we handle that in the future?
I was thinking to look at the string and return a promise that resolved to an iterable if a wildcard character was seen or a search query was passed.
We might end up having them as proper methods on the exporter root instead. The other solution would be making it be an iterator all the time which is kind of what we have now (in that it's always a pull stream) which doesn't feel very ergonomic.
Anyway, it's all future tech, we can solve that problem on a separate issue.
@achingbrain thanks for making things more explicit. At https://github.com/ipfs/js-ipfs-unixfs-exporter/issues/7#issuecomment-443226427 I meant that exporter('/ipfs/Qmfoo/bar', ipld)
wouldn't return an FSEntry
, but some "Exporter object". You would then call either content()
or path()
on it.
This has been implemented.
Proposal
To change the return type and codify the behaviour of this module.
Current API
Observations
Proposal
Options
These would remain the same.
Sharding
Sharded directories are treated like normal directories. That is, they have a type of
directory
.Behaviour
When requesting a path:
/ipfs/Qmfoo/bar/baz
, ifbaz
resolves to a directory, the contents of the directory will be emitted, otherwisebaz
itself will be emitted.If
maxDepth
is specified and matches the zero-indexed number of path components (excluding/ipfs
),baz
will be emitted even if it is a directory.If
fullPath
is specified, each path component encountered will be emitted (excluding/ipfs
).Examples
When
baz
is a directoryWhen
baz
is a file