"Size" fields to include

kevina commented 6 years ago

Th old unixfs has two sizes the file size which and the total size of the protocol-wrapped objects (the physical size). The same sizes where used for directory entries except perhaps not for sharded directories (see #7).

The question is are both sizes still useful to include? Based on some discussion on #2 I think maybe we should simplify things and just have the file size. If the any use for the physical size even used anywhere?

In addition, the file size isn't really useful for directories. A better size to include would be a count on the number of entries. This count will also allow seeking of shared directories (see #6).

Thoughts?

warpfork commented 6 years ago

+1 to both file size only (not "physical" size) and +1 to dirent as a count.

I've always been completely bewildered by ls reporting dirs as 4k (the "physical" size -- which I've precisely never cared about). Can't imagine there's much use for physical size reports on files either; the only context I can imagine is if generating a report on the overall physical size use of an IPFS repo, and that would need to report non-unixfs objects as well, making a special inclusion of that in unixfs objects redundant at best.

Stebalien commented 6 years ago

I would still include both the physical size and the file size. I would not include directory sizes.

The physical size is useful for calculating file download progress, space usage (well, modulo reduplication), etc.
The file size is useful for seeking, etc.

mikeal commented 6 years ago

I'm trying to understand the use case for knowing the size of all the nodes and not just the data.

I wouldn't want to trust this kind of information for managing quotas since it's not a guarantee.

For space usage it's also not entirely accurate. There's no guarantee that because I have one of these blocks that I've succeeded in also storing the rest of the graph.

Have the content size of each file, and the cumulative size of all the files in each directory, is enough to shows download progress.

If we adopt the file-data format we could even get away with not including the size attribute since you can easily figure this out by looking at the data array.

achingbrain commented 5 years ago

Can the directory size as the sum of all the directory entry sizes be included as well?

In v1 We can't calculate directory sizes without traversing all children of the node as it may be a HAMT shard so is out of the question, but we can't create the directory unless we know which files are in it so we do have the directory size at creation time. Seems weird to throw that information away.

I would not include directory sizes.

@Stebalien could you expand on why not?

mikeal commented 5 years ago

@achingbrain that’s actually how it works now :) https://github.com/ipfs/unixfs-v2/blob/master/SPEC.md#ipld-dir

The size of a directory is the sum of the size of all the size properties in data, so that includes the size of files and sub-directories.

However, this is the cumulative size of file “data” and not the size of the blocks. We got rid of that information because it doesn’t really work well in this new model where the block boundaries are transparent.

Also, as @warpfork reminded me today, we need to call out in the spec that while implementations of unixfsv2 MUST encode this accurate, readers of this data should consider the property advisory since there is no way to guarantee it is accurate without parsing the entire graph.

achingbrain commented 5 years ago

Hooray!

V1 DAGLink sizes have been similarly untrustworthy since forever.

rvagg commented 1 year ago

closing for archival

ipld / legacy-unixfs-v2

"Size" fields to include #9