unixfs spec missing - Githubissues

JustinDrake commented 7 years ago

The unixfs spec is completely missing https://github.com/ipfs/specs/tree/master/unixfs

Can I find rough notes somewhere specifying unixfs?

daviddias commented 7 years ago

We are missing that spec. However, we have documentation in the several pieces of Unixfs, namely:

The Data Structure - https://github.com/ipfs/js-ipfs-unixfs#usage
The Engine that handles multiple graph builders - https://github.com/ipfs/js-ipfs-unixfs-engine

Let me know if that helps :)

JustinDrake commented 7 years ago

Thanks @diasdavid.

Below are some specific questions I have:

The filesize of a link can easily be faked, right? What is the point of links having a filesize if that filesize cannot be trusted?
Can a file with a super long file name be sharded? (I understand there's sharding for large directories and large files.)
What are metadata links?
It seems there are two ways of declaring a block "raw". The first is with the unixfs raw type. The second is with the raw multicodec as part of the CIDv1. Which takes precedence? How are conflicts resolved?
In the case of a link, what does the data field hold?
Is DagCBOR implemented yet? Is there an IPFS flag to make use of it by default?
unixfs requires names to be unique. What happens if they are not?

daviddias commented 7 years ago

Excellent questions! Thank you, @JustinDrake :)

The filesize of a link can easily be faked, right? What is the point of links having a filesize if that filesize cannot be trusted?

Absolutely. It's convenience for things like stats, if it is application critic it should be verified externally.

Can a file with a super long file name be sharded? (I understand there's sharding for large directories and large files.)

Not part of the spec.

What are metadata links?

Not currently used. Designed for things like permissions.

It seems there are two ways of declaring a block "raw". The first is with the unixfs raw type. The second is with the raw multicodec as part of the CIDv1. Which takes precedence? How are conflicts resolved?

Both exist in different realms. Unixfs raw type is the unixfs protobuf with a type raw serialized and inserted into a dag-pb protobuf.

IPLD raw type is really just any array of bytes

In the case of a link, what does the data field hold?

What is the case of 'link'? A good way to understand the data struct is to add some files and directories and explore them using the ipfs object or ipfs dag API

Is DagCBOR implemented yet? Is there an IPFS flag to make use of it by default?

ipld-dag-cbor is implemented https://github.com/ipld/js-ipld-dag-cbor

Unixfs uses ipld-dag-pb. There is currently no plan of moving it to ipld-dag-cbor

unixfs requires names to be unique. What happens if they are not?

Per directory level. If there is a folder with 2 files using the same name, then that is an error. To make that happen you would have to manipulate the graphs directly.

JustinDrake commented 7 years ago

Thanks 👍 Some follow-up questions:

I don't understand the data field. Let's take an example with zdj7WYjg5Gek1VmesaAFnT7nzi15xhAYMt1yxBxDyQSNgG1gy. The dag/get endpoint returns CAE= for the data. The object/get endpiont returns \u0008\u0001. Why are the returned data fields different? What is the significance of CAE= and \u0008\u0001.
What happens if an IPLD CIDv1 of type raw points to an array of bytes which also happens to be a serialised unixfs dag-pb protobuf? Is that interpreted as a byte array, or a dag node? Because in the CIDv0 case (where the IPLD type is not specified) there is an inherent ambiguity here, right?
The unixfs protobuf has no versioning. Is unixfs meant to be upgradable?

Stebalien commented 7 years ago

CAE= is "\u0008\u0001" base64 encoded (otherwise known as [0x8, 0x1]). However, I'm not sure why we're using two different encoding schemes (IMO, both should return "\u0008\u0001" but there's probably a reason?).
In CIDv1, if the CID says it's a byte array, it will be interpreted as a byte array. In CIDv0, nodes are always interpreted as dag nodes. There are no "raw" CIDv0 nodes; there are just CIDv0 nodes with only a data field.
Protobufs generally don't use explicit versioning (as far as I'm aware). Instead, you just add more optional fields that only mean something to newer versions of the software. If you need to to introduce a backwards incompatible change, you'd can add a new datatype: that's how sharded directories (HAMTShard) was introduced.

JustinDrake commented 7 years ago

@Stebalien Cheers :) My unixfs understanding is starting to crystalise 👍

I understand there's HAMT sharding for large directories, and chunking for large files. What about very large files? For example, a 1TB file will need a million 1MB chunks, and so a million links. Those million links cannot fit in a single DAG node, so is there sharding there also? Or maybe the chunking happens in several steps, where first the 1TB file is broken into a thousand 1GB chunks, which are then broken down into 1MB chunks?
If I add a folder using unixfs with add -r --cid-version 1 can I confirm that the CIDs can only be of type DagProtobuf (0x70) or Raw (0x55)?
If not for unixfs, what is the intended immediate use case for DagCBOR?

Stebalien commented 7 years ago

(we = protocol labs, not necessarily me)

You can chunk files recursively however you want; IPFS even has different chunking strategies for different use-cases.
Currently, yes. As a matter of fact, go-ipfs assumes this. However, this generally shouldn't be assumed. The CID tells you how to interpret the raw data into an IPLD object, not how you should interpret the IPLD object.
Is too long to fit in a bullet point...

So, DagCBOR is the canonical IPLD format that can encode every IPLD object (arbitrary {"foo": "bar", "baz": Qm...}). DagProtobuf can only encode IPLD objects of the form { "data": bytes..., "links": [ {"name": ..., "size": ..., "link": Qm... } ] }. So, the real question is: why DagProtobuf?

The answer is that IPFS (and these DagProtobuf objects) came before IPLD. However, while building IPFS, we realized that DagProtobuf was hard to work with. To structure data, you have to serialize it and embed the structured data in the data field as an array of bytes. Worse, this structured data can't actually link to other objects because links must go in the links section (so the data section needs to reference links in the links section to actually link to other objects). So, we made IPLD to make storing structured data easier. Now, why did we keep DagProtobuf? The answer is simply backwards compatibility.

So, what is the use-case for DagCBOR: building other applications (and, potentially, extending IPFS). It's what we would have used to build IPFS if we could start over.

lidel commented 1 year ago

Continued in https://github.com/ipfs/specs/issues/316

ipfs / specs

unixfs spec missing #162