ipfs / kubo

An IPFS implementation in Go
https://docs.ipfs.tech/how-to/command-line-quick-start/
Other
16.11k stars 3.01k forks source link

[Feature Request] Recognize and render an mtime metadata field for "Directory" unixfs nodes #3451

Closed mib-kd743naq closed 4 years ago

mib-kd743naq commented 7 years ago

( Please move to notes if this is the wrong forum )

Version information:

N/A

Type:

Feature/Enhancement

Priority:

P4

Description:

Compare the experience of visitng a conventional mirror listing and the experience of an IPFS-rendered directory.

The presence of mtime ( at least for me ) is a very valuable indicator of "how stale is this mirror exactly?". I am aware that the number can be faked, and this is fine: it can't be authoritative anyway, only the content can be. Yet it is really useful when eyeballing things.

IPFS currently does not make a provision for storing this information ( both the current implementation and whatever I could find about the IPLD plans ). I think it would be a large step in usability to make this possible.

On implementation: allowing specification of mtime within a UnixFS (or other type of ) content-node itself is a bad idea. Instead mtimes ( and possible other future metadata ) should be an optional property of links. This fits nicely into the sharded directory implementation we have going as well. Something like (note - this could very well be an IPLD-only feature, using mdag to kick-start discussion):

message PBLink {
  optional bytes Hash = 1;
  optional string Name = 2;
  optional uint64 Tsize = 3;
  optional uint64 PresentationTimestamp = 4;
}

or maybe even:

... optional timespec PresentationTimestamp = 4;

message timespec {
  required uint64 sec = 1;
  optional fixed32 nsec = 2;
}

Thoughts?

whyrusleeping commented 7 years ago

I think we should use the metadata unixfs type for this.

https://github.com/ipfs/go-ipfs/blob/master/unixfs/pb/unixfs.proto#L18

We havent gotten around to actually doing anything with this yet, but its basically what it was designed for

mib-kd743naq commented 7 years ago

@whyrusleeping I briefly considered it, and faced the following trouble:

whyrusleeping commented 7 years ago

the MimeType should be changed to optional. We shouldnt be using required fields anyways.

The metadata node links to the object it describes. They are designed to only point to a single child node.

mib-kd743naq commented 7 years ago

The metadata node links to the object it describes.

Hm... then... how does it work with a file in a dir? Let's take QmW4UsYZRtxSBtQgCkHEsFpJDMNYv8j4eFpEU97w1fEHtp:

QmW4UsYZRtxSBtQgCkHEsFpJDMNYv8j4eFpEU97w1fEHtp

             ||
             ||
             ||
  1 {
    1: Directory
  }
  2 {
    1: QmVzzr9ArZaQZSWTYgUpAN4fJrK46qmbYpf95CyD4fz7jd
    2: "foo"
    3: 109                    ||
  }                           ||
                              ||
                              ||
           1 {                ||
             1: Directory
           }
           2 {
             1: QmSMxgMY4ZpB9H7zbn3AhLNobnjjnzkfeEKwXaAJ9ibr1G
             2: "bar"
             3: 60                    ||                                        
           }                          ||                                        
                                      ||                                        
                                      ||
                    1 {               ||
                      1: Directory
                    }
                    2 {
                      1: QmQpeUrxtQE5N2SVog1ZCxd7c7RN4fBNQu5aLwkk5RY9ER
                      2: "baz"
                      3: 11                     ||
                    }                           ||
                                                ||                        
                                          1 {
                                            1: File
                                            2: "abc"
                                            3: 3
                                          }

Where do various Metadata nodes attach, and how does a containing (parent) directory find out about them for display?

whyrusleeping commented 7 years ago

Think about metadata nodes as semi-transparent wrappers. They can wrap any other unixfs node to augment it with some extra information. If you point directly to a file, and not the metadata object, then you obviously don't have metadata for that file.

mib-kd743naq commented 7 years ago

Think about metadata nodes as semi-transparent wrappers.

Understood. Not ideal from a low-level PoV as it doubles the amount of nodes one needs to use to represent a directory listing, but it is workable.

I will add some tentative examples to the "unorthodox-dag" I am putting together.

mib-kd743naq commented 7 years ago

@whyrusleeping thinking about it more... I still think it is incorrect to have an extra metadata node as a standalone object. Consider the earlier example listing. One would ideally want to expand this with:

Currently for any of these we would need to do an extra fetch for every entry in the directory, which even for moderately large directories will be expensive. If we add metadata nodes - then there are potentially two lookups needed (at least for the pointed-to-object id part)

Instead if we forget about the existence of metadata at the unixfs level, and instead redefine:

message PBLink {
  optional bytes Hash = 1;
  optional string Name = 2;
  optional uint64 Tsize = 3;
  optional UnixfsMetadata UfsMeta = 4;
}

Then a lot of the problems with lookups go away and link nodes can remain lean if they simply omit the metadata sub-structure.

@jbenet thoughts on the above?

mib-kd743naq commented 7 years ago

P.S. Yes, I am aware of the IPLD specification effort. But it doesn't seem to cover any of this either...

jbenet commented 7 years ago

Hey @mib-kd743naq -- some quick comments:

Currently for any of these we would need to do an extra fetch for every entry in the directory

To confirm we're understanding each other, there would be at most one Metadata object for each file, and not one Metadata object per metadata key-value entry. It's one dict. So instead of N nodes per entry in each dir, 2N nodes. And not kN (where k is the number of metadata entries used).

The reasoning for separating metadata out and making it a wrapper around the file (instead of embedded within the file node OR another node pointed to by the file node) is to make sure that all raw data versions of the file are deduped, and that changing the mode does not constitute creating a new root object. Granted, the bulk of the data is at the leaves, so those wouldn't change anyway, only the very root, but im not sure that preventing the extra node fetch is ideal here.

Let me ask it this way: why is an additional node a problem?

If it's (1), then we can proceed and optimize bitswap instead. if it's (2), then we should be working on optimizing the datastore.

I don't think either of this is reason not to create a larger graph-- in that a good ipfs implementation should make additional nodes not very expensive. They are indeed expensive now, so could we work on making them not expensive? instead of hiding the expense by changing the data model.


separately, i would like to know whether you think -- perf concerns completely aside -- the metadata object wrapper is confusing or bad from a data modeling perspective. I.e. is this the wrong thing to do? or just a "currently expensive but optimizable" thing to do?

jbenet commented 7 years ago

quick note: since the addition of IPLD and "raw data" nodes, "unixfs-file" has become differentiated from the raw data they contain. Therefore, it may be that storing the metedata directly in the file is no longer that big of an issue. That said, we should note that we would need to do this twice, once for files and once for dirs, instead of once (once for metadata, wrapping both files or dirs).


Another point pro metadata: it can be fetched comfortably by an ls call, as a metadata object should not be large, therefore figuring out a bunch of important info required without accidentally pulling a bunch of ~256K objects.

mib-kd743naq commented 7 years ago

The reasoning for separating metadata out and making it a wrapper around the file (instead of embedded within the file node OR another node pointed to by the file node)

@jbenet this is not what I was suggesting. I am suggesting that the metadata become property of the unixfs directory link itself. I.e. having this thing become this:

message PBLink {
  optional bytes Hash = 1;
  optional string Name = 2;
  optional uint64 Tsize = 3;
  optional UnixfsMetadata UfsMeta = 4;
}

Could you reread my original comment/question with that in mind and respond to that? A lot of what you wrote above is very reasonable, but alas does not apply to what I said. Sorry for the confusion... :/

If my reasoning still isn't clear - I will try to explain this in much greater detail sometime towards end next week: work is going crazy...

kalmi commented 7 years ago

I would like to push this forward (willing to contribute code) as having optional mtime metadata could enable novel use cases such as:

I am perfectly okay with the current Metadata-wrappers. Compared to @mib-kd743's suggestion, the only advantage I see in separate Metadata nodes is smaller memory/disk consumption when metadata is not present (and thus less to copy around). The different root hash argument is invalid as far as i can see, because the directory's root hash also changes by the introduction of a wrapper node. Metadata retrieval cost for large directories could be prohibitive, but I agree with @jbenet that that can be solved with bitswap.

If we are to extend the Metadata node with an optional ModTime field, is everyone okay with the timespec type proposed by @mib-kd743naq? I considered the possibility of having only second-level accuracy, but i believe that metadata should be preserved as accurately as possible.

mib-kd743naq commented 7 years ago

@kalmi thank you for pushing this further! Really glad someone else showed interest, as I myself got completely swamped at work and likely won't be able to get back to IPFS goodness for another 2 months :(

On the subject of where the metadata should live: I think I was misunderstood by pretty much everyone in this thread. I am still convinced that the mtime and other content-unaffecting metadata should be part of the link structure itself. The reason isn't efficiency during data access, but efficiency during metadata access ( i.e. browsing ).

To illustrate this differently: one of the projects for which I need this functionality is serving a standard ftp mirror over http but backed by IPFS. Under my proposal in order to display this listing one needs to download very few blocks. Under the "separate metadata nodes" proposal one needs to grab thousands of them. This is a non-trivial difference.

All of this notwithstanding - I will take anything at this point.

kalmi commented 7 years ago

@mib-kd743naq, I totally get what you are saying, and yes, the performance implications also seem scary to me, but I also get the others who wish to keep PBLink lean and continue the existing path, the Metadata node. I am bound to accept that it can eventually be solved efficiently with bitswap selectors if @jbenet says so.

kalmi commented 7 years ago

I suggest adding a --save-mtime boolean flag to ipfs add that would wrap all unixfs nodes in Metadata containing ModTime. I suggest adding a --restore-mtime boolean flag to ipfs get that woud set the mtime on the created files/directories/links based on the content of the Metadata wrappers.

The default behaviour wouldn't change one bit in either case.

After we have that, I will also suggest adding a Experimental.TrustModTimes boolean config option in a different issue. When this config option is true, IPFS could make use of some optimizations to make repeated adds and gets faster. I will get into that later in more detail. First, I just want to get ModTime into the Metadata structure, and implement the --save-mtime and --restore-mtime flags and their behaviour.

I would love to have some assurance that these additions would be welcome.

kalmi commented 7 years ago

Moved timespec message format definition discussion to: https://github.com/ipfs/notes/issues/60

mib-kd743naq commented 7 years ago

but I also get the others who wish to keep PBLink lean and continue the existing path

Actually this was never explained. There was an unambiguous "we are not modifying UnixFS anymore, all new work is going into IPLD", but there was no reasoning nor timeline behind this.

I follow and am excited about many new developments: the OpenBazaar integration work, the Peergos stuff, and many other neat things build on top of IPFS. But as @mguentner said back in November: the fs-stuff in IPFS is still not really usable.

@jbenet @whyrusleeping: could you please add some clarity on what plans if any are there to be able to faithfully represent content like this? I know it is not sexy work, but still - inquiring minds want to know ;)

Stebalien commented 4 years ago

Will be fixed by https://github.com/ipfs/go-ipfs/issues/6920.