Adding additional file metadata to UnixFSv1

mikeal commented 5 years ago

Current UnixFSv1 importers do not encode most of the standard file metadata from most file systems.

This has been a particular challenge for package managers since they already rely on some of this metadata.

The goal of this issue is to surface all the necessary discussion points in order to drive a new PR against the unixfs spec.

Potential metadata

Permissions
- Executable bit
- Ownership (user and group)
Filename in file object
mtime
ctime
atime

Additional considerations

For time stamps (mtime, ctime, atime) we need to decide if we’re going to use high precision times or not. Most systems expect a 32-bit integer (low precision) while other use cases may need a 64-bit integer (high precision).

Do we want to store additional metadata of the directory? How do we handle updating this when someone updates only a single file in the directory?

Where do we store this metadata?

In terms of the data format, should these properties be added to the File message or the Data message?

History

The history of this feature as well as meeting notes where this feature was prioritized are available here.

lidel commented 5 years ago

Could this also include support for opt-in setting of content type?

The spec @ 12a3d57 already has a field for this (but it does not seem to be wired to anything):

message Metadata {
    optional string MimeType = 1;
}

This would enable people to solve false-positives in content-type sniffing before v2 lands (https://github.com/ipfs/unixfs-v2/issues/11)

mikeal commented 5 years ago

Can someone more familiar with the original spec and implementations explain how the Metadata message is currently used? It seems obvious that we should leverage it and also start using the MimeType field but without knowing a bit more about the history and current usage I can’t tell if we’re likely to break anything.

warpfork commented 5 years ago

I'm not sure what to make of the MimeType idea. Unixy filesystems don't have a concept of MimeType; that's much higher level.

It certainly seems like an sizable embiggening of scope from the bullet points at the top.

mib-kd743naq commented 5 years ago

Can someone more familiar with the original spec and implementations explain how the Metadata message is currently used?

As far as I know it was never implemented in neither go-ipfs nor js-ipfs. The way it was supposed to work is somewhat described here. I would again strongly advise steering clear of this construct: wrapper-blocks carrying metadata are not... great.

ivan386 commented 5 years ago

@mib-kd743naq at now time metadata block can be included in directory block by using identity hash. And file block can be included in metadata block in same way.

mikeal commented 5 years ago

I’d like to surface these tradeoffs so that the folks with use cases driving the need for it can comment appropriately.

Using the Metadata message will:

Increase the graph depth by 1 for every file, and also by 1 for any directory we add metadata to.
Increase the amount of de-duplication we can do of the main file metadata object (not the actual file data, since that’s de-duplicated either way).

Adding the metadata to the file/dir object itself will:

Duplicate this file/dir object for two files that are effectively the same but have different metadata. Again, the actual file data is still de-duplicated either way.
Avoid increasing the graph depth.

I’d like people closer to the use cases to weigh in on which of these they find most compelling. @andrew @alanshaw @achingbrain

mib-kd743naq commented 5 years ago

@mikeal you are missing option 3 though: "metadata is part of the 'directory' entry"

mikeal commented 5 years ago

@mib-kd743naq I updated my comment to be “file/dir” in the case of directory metadata. If there is another option you’re suggesting we explore where the metadata for every file in a directory is added to the directory entry we’ll need to discuss that a bit more before I add it because that sounds quite problematic when we start dealing with sharded directories :(

ianopolous commented 5 years ago

We have a mime type field in file metadata in Peergos, so can relate our experiences. There are two things useful to be aware of. 1) a file can have multiple mime types depending on the context 2) some mime types can't be deduced until the entire file has been read

mib-kd743naq commented 5 years ago

@mikeal words are hard... instead next week during chaos camp I will attempt to build a PoC similar to my last large scale stress test, but this time for various types of metadata embedded in backwards-compatible-ish variants of dag-pb.

Then a concrete discussion based on actual blocks can be had.

achingbrain commented 5 years ago

I’d like people closer to the use cases to weigh in on which of these they find most compelling

Expanding the fields in the UnixFS data type seems like the most sensible path as adding extra nodes for each and every file will become expensive for very large file systems (package manager datasets, for example).

Two files with different metadata will have different root nodes, but I think this is fine as the file data is still de-duped across the two and fundamentally the metadata has to be stored somewhere. If we can do that without causing another network/disk/blockstore trip then great.

Stebalien commented 5 years ago

@warpfork the mime-type is there because users sometimes want to explicitly specify the MIME type. Unfortunately, this can be very important when using ipfs with a gateway. The alternative of just encoding this in a separate file and having the gateway interpret it was also discussed.

Some history: the metadata block was supposed to be used as follows:

{
  Data: {
    Type: Metadata,
    // stuff...
  },
  Links: [{Cid: ActualFile}]
}

Unfortunately, doing it this way would be a slightly breaking change (for users of this feature). Inlining metadata directly into files would not.

As @mib-kd743naq points out, we could also inline into directories. This also gives us fast LS (which is currently a bit annoying). We could even add file types to directories (the repeated information shouldn't be an issue).

The primary problem with this is that resolving to a CID and then copying wouldn't carry the metadata.

On the other hand, this isn't unreasonable. Names are already a part of the directory. Making metadata a part of the directory isn't all that odd. I'd expect most tools to reference files relative to directories anyways.

A hacky alternative is to:

Embed the metadata into an intermediate block with the type "file".
Don't stick the actual data into this block. Instead, force it out into a second block.

This matches the original design without breaking anything.

warpfork commented 5 years ago

+1 towards the idea that if MIME type is getting well-known support, it should be something we move towards the gateway knowing of it, rather than making it a feature of the filesystem. This would be a much closer set of relationships to how the rest of the world works already (e.g. doing sysadmin today with nginx or something, I would generally configures MIME types at the webserver area, and not in filesystem metadata) -- and thus seems much less likely to go awry.

Carefully avoiding baking in the idea of a single "mimetype string" field into our filesystem metadata also leaves much more room for issues to evolve around the things Ian mentioned:

a file can have multiple mime types depending on the context

some mime types can't be deduced until the entire file has been read

mikeal commented 5 years ago

PR is up now at https://github.com/ipfs/specs/pull/220

Note that I used uint32 for all the time data. In unixfsv2 we’re considering properties for 64bit high precision times but since uint32 is what most people expect I figured that was appropriate when adding these to unixfsv1.

warpfork commented 5 years ago

(I just commented this on the PR, but posting again here for discoverability for anyone who didn't follow the jump to the PR...)

I'd like to just mention a couple links to prior art that's not merely prior art, but also particularly easy to read and review for inspirations:

the casync project persists a lot of things -- and this is particularly interesting because some of them deeply silly IMO (even as flags), but nonetheless, it's quite a list of things you could preserve: https://github.com/systemd/casync/blob/e4a3c5efc8f11e0e99f8cc97bd417665d92b40a9/src/caformat.h#L82-L125 It's such a list of things you could preserve that I think it's a fantastic highlight of why the should-vs-could discussion is always, always important :)
the reproducible builds group has done some comparisons of archive systems and what they support which is very comprehensive in its considerations: https://reproducible-builds.org/events/berlin2017/ReproducibleSummitIIIEventDocumentation/ , heading "Mapping out archive formats" (though this is badly munged, sorry to say)
- EDIT: @ianopolous was so kind as to format this better!: https://gist.github.com/ianopolous/35c895b1473d533a2c485a49aaa1541b

Both of these (as well as the specs of tar, which I'm assuming everyone's at least given a cursory glance at already) are highly worth a quick skim just to see what other people have covered when trying to map this terrain.

There are large (large) bodies of thought on this out there already, and while we may or may not choose to do some things differently, we should make sure we're doing that on purpose. We'll be doing ourselves a sizable disservice if we add new features that unintentionally strike too far outside the norm by sheer accident of not having checked where the norm is.

achingbrain commented 4 years ago

I think this can be closed now - we've added mtime and mode to UnixFSv1, additional fields and arbitrary metadata will probably wait for UnixFSv2.

ipfs / specs