UnixFS protocol buffer message definition

schomatis commented 6 years ago

As said in the UnixFS specification there is no clear specification document and one needs to examine reference implementations (such as the one on this repo) to understand the format.

The protocol buffer message definition has no comments and ambiguous names like Data, which is actually used twice (as the name of the message and also one of its types) leading to very confusing code lines such as this one:

https://github.com/ipfs/go-ipfs/blob/1e9e2f453c435272433a96ebfba5c4319cc08534/unixfs/unixfs.go#L190

I don't think we can change any of the identifier names in the .proto file without breaking the API but at least we should document it as much as possible to avoid future confusions.

Also I would like clarify exactly what is the connection to the FSNode structure (@Stebalien help),

https://github.com/ipfs/go-ipfs/blob/1e9e2f453c435272433a96ebfba5c4319cc08534/unixfs/unixfs.go#L142-L154

which seems to be the counterpart of the protocol-buffer generated structure,

https://github.com/ipfs/go-ipfs/blob/1e9e2f453c435272433a96ebfba5c4319cc08534/unixfs/pb/unixfs.pb.go#L71-L79

but I'm failing to understand where do the responsibilities of one end and the other one start.

schomatis commented 6 years ago

One confusing relationship between those structures is how are sizes handled.

Data.Filesize: Size of the data that the structure is holding plus the file sizes of its children stored at Data.subtotal,

https://github.com/ipfs/go-ipfs/blob/1e9e2f453c435272433a96ebfba5c4319cc08534/unixfs/unixfs.go#L188

FSNode.subtotal: Sum of child nodes Data.Filesize that keeps being updated as more children are added,

https://github.com/ipfs/go-ipfs/blob/1e9e2f453c435272433a96ebfba5c4319cc08534/unixfs/unixfs.go#L173-L176

FSNode.blocksizes: Despite its comment that it has the total data size for each child, it contains (as showed in the code above for subtotal) the Data.Filesize of each of the child nodes (that is, not just the length of their Data.Data field nor the entire size of the Data structure).
Data.Blocksizes: It just mirrors FSNode.blocksizes,

https://github.com/ipfs/go-ipfs/blob/1e9e2f453c435272433a96ebfba5c4319cc08534/unixfs/unixfs.go#L189

I think the FSNode structure could just contain the Data structure (generated by protocol buffers) and manipulate its fields directly instead of mirroring many of them inside itself (which makes the code harder to follow because there's another level of indirection).

Something important to note is that the FSNode.subtotal (and the FSNode.blocksizes) is updated with the size of its child nodes at the moment the child is added,

https://github.com/ipfs/go-ipfs/blob/1e9e2f453c435272433a96ebfba5c4319cc08534/importer/helpers/helpers.go#L102-L103

(see AddBlockSize() above). This means that the child needs to be fully populated (its file size up to date) before it is added to the parent. I'm guessing this has already been taken into account as both importers (balanced and trickled) do a recursive (DFS) build of the DAG but I haven't seen that design decision documented anywhere and that might create problems when another layout is added (or the existing ones modified) that does some kind of BFS build.

schomatis commented 6 years ago

I think the FSNode structure could just contain the Data structure (generated by protocol buffers) and manipulate its fields directly instead of mirroring many of them inside itself (which makes the code harder to follow because there's another level of indirection).

I'm submitting a PoC PR (https://github.com/ipfs/go-ipfs/pull/5098) to research how (and whether) to integrate these two structures to avoid duplicating the UnixFS format fields. FSNode should be adding more functionality to the pb.Data message format, not more (unnecessary) fields.

schomatis commented 6 years ago

I don't think we can change any of the identifier names in the .proto file without breaking the API but at least we should document it as much as possible to avoid future confusions.

From what I'm reading in the PB documentation only the numbers and types are what matter in the .proto file for the encoder, so in addition to adding some comments the original names could be revisited. I'm especially interested in eliminating the Data duality and renaming Blocksizes to better reflect that it is actually a list of the children's Filesizes. @Stebalien WDYT?

Stebalien commented 6 years ago

Data, Data, Data

So, the difference here is "proto node" versus "unixfs node". "protobuf nodes" have a Data and Links field. where the Data field contains all the binary data and the Links field contains all the links to different protonode objects. In UnixFS, we stick a unixfs protobuf node inside the data field. This unixfs protobuf node also has a data field for file data.

Yes, this is confusing. It's one of the motivations behind unixfs v2 effort. Once we land unixfs v2, I'd like to reinterpret all "unixfs protobuf in protonode" objects as unixfs 2 objects (via a transform that happens on deserialization). That is, define a new IPLD format for this and hide all the details from the user. However, it's now one of the many "we want it but need to make time to work on it" things.

Also I would like clarify exactly what is the connection to the FSNode structure (@Stebalien help),

Answered in the #5098 discussion.

This means that the child needs to be fully populated (its file size up to date) before it is added to the parent. I'm guessing this has already been taken into account as both importers (balanced and trickled) do a recursive (DFS) build of the DAG but I haven't seen that design decision documented anywhere and that might create problems when another layout is added (or the existing ones modified) that does some kind of BFS build.

This is just a general requirement of DAGs. Regardless of what we do, we'll always have to update the parent anyways. Therefore, we may as well memoize information like "size".

Having accurate size information in every block is also important for seeking.

From what I'm reading in the PB documentation only the numbers and types are what matter in the .proto file for the encoder, so in addition to adding some comments the original names could be revisited.

Correct.

I'm especially interested in eliminating the Data duality and renaming Blocksizes to better reflect that it is actually a list of the children's Filesizes. @Stebalien WDYT?

So, we could rename data but I can't think of a better name. It really is the "data". I guess we could call it InlineData or FileData? As for blocksizes, it's a list of the filesizes of the "blocks" (children, as you put it).

However, if you can think of better names, we can consider it (we just have to be careful as a lot of code depends on these field names, including JS code).

schomatis commented 6 years ago

So, we could rename data but I can't think of a better name. It really is the "data".

Yes, but of the DAG layer, from that side you're storing generic data, but from the UnixFS side that is a file (or file part), the generic data UnixFS is storing is the field Data, which from an upper layer perspective that could even be something else, e.g., a DB record, but I wouldn't name the structure that represents that record as Data just because the UnixFS layer sees it as such.

As for blocksizes, it's a list of the filesizes of the "blocks" (children, as you put it).

That's something very important that causes a lot of confusion in the code (it always made me to trip up on the balanced builder): what is a block? Normally a call a block just DAG's lower layer information objects, that's it, but I'm seeing that sometimes the block term is used for leaf DAG nodes, and even if that is so, a child node is not necessarily a block, i.e., a leaf node, it could be another internal node (and that one is definitely not a block, at least I haven't seen any references as such). From what I gather working on FSNode, Blocksizes stores child node's Filesizes, it should reflect that, that it's just aggregating Filesize values, changing "file" for "block" makes it seems that we're no longer talking about the same kind of sizes, that it's the size of a different attribute.

However, if you can think of better names, we can consider it (we just have to be careful as a lot of code depends on these field names, including JS code).

Yes, that's what stopped me from doing it in the first place, I'll prepare a PR then for you to review and maybe someone from the JS camp could provide some feedback.

ipfs / kubo

UnixFS protocol buffer message definition #5074