Recursive add, and symlinks

mitra42 commented 6 years ago

Version information: ipfs version 0.4.17-rc1

Type: Bug ?

Description:

I'm trying to upload the distribution directory for the Archive.org interface to IPFS so it can be accessed via IPFS (or via IPNS).

Problem is that the add process is ipfs add -Qr dist which appears to work. However, but if you look at https://ipfs.io/ipfs/QmehDxtj8UjivCXFSV4XMXCh8zVxER4D57LguXogX8rsRD/ you'll see that the symlinks to the bundles aren't uploading correctly and clicking on them gets the ipfs cat /ipfs/QmehDxtj8UjivCXFSV4XMXCh8zVxER4D57LguXogX8rsRD/dweb-objects-bundle.js: cannot currently read symlinks message.

I understand that this raises issues (there are a number of open issues that relate to symlinks). I'm just wondering what IPFS thinks the "correct" behavior should be, i.e. is a failure to upload to be expected - which needs a permanent workaround, or is it a bug in ipfs add -r that will at some point get fixed (needing something short term till its fixed).

The problem is that its non-trivial to work around. There is a reason these links are symlinks, and there is no (non-hacky) way I can figure out to fix just those files in the script that includes ipfs add. For example, if ipfs add dweb-transports-bundle.js from the place that file is sitting, then I can't easily insert that link into the directory structure that is pointed to by the ipfs add -r

Personal opinion (and there may be good reasons I'm wrong :) would be that ipfs add -r should have an option that copies the destination of symlinks found.

NiKiZe commented 6 years ago

I fully agree that symlinks supports needs to be resolved. (causing issues for me as well, as you said there already are open issues about it) Symlinks are stored as a special object

ipfs object get QmctohSoXj49FXBMKL7w9gofBd1h5fUqDNsRKM3ToZF5dG
{"Links":[],"Data":"\u0008\u0004\u0012I../node_modules/@internetarchive/dweb-objects/dist/dweb-objects-bundle.js"}

In this case it points outside the root.

Magically replacing all symlinks on add could be considered a bug (but also a nice feature if it is an optional argument) Warning for when symlinks are added would be good, especially when it points "out of tree"

The gateway code (and a few other parts) needs to get support for understanding these symlinks, as long as it is "in tree" it should be possible. Maybe the symlink format should be extended with a hash of the symlink destination.

Stebalien commented 6 years ago

So, @djdv is adding a --dereference-symlinks option to follow symlinks on add. However, I'm not sure if that entirely fixes your issue as it will erase the symlinks entirely.

For relative symlinks that don't leave the root, we could probably follow them on cat. However, this isn't really something we can do otherwise (for obvious security reasons, we can't leave the /ipfs/ namespace). Would this solve your use-case?

georgyo commented 6 years ago

I just ran into this problem myself.

What I want to do is setup an archlinux mirror in ipfs. The problem is upstream mirrors heavily use symlinks to central pool directories. What I would like to happen is that safe-symlinks (like rsync) are a pointer to the other object.

Currently, ipfs stores the symlink just as a symlink. What I would like to happen is that safe-symlinks (like rsync) are a pointer to the other object. Which means that the ipfs object would store both the symlink information (what path does the symlink go to) as well as a reference to the ipfs object it would have pointed to.

Example: /ipfs/QmYm7QPVeMshN34b5nssMTgmvYiLDbSCEbUFwqAaNgMAgK/community/os/x86_64/go-ipfs-0.4.17-1-x86_64.pkg.tar.xz (QmTTcTHCMw2GAbK4KVuo8ex8tFQ2oGw1c9PHrLFd8ifgvt) is a symlink to /ipfs/QmYm7QPVeMshN34b5nssMTgmvYiLDbSCEbUFwqAaNgMAgK/pool/community/go-ipfs-0.4.17-1-x86_64.pkg.tar.xz (QmXLuLzshCejKHqP4fUdv7W7x5aVZKySjE8m8CKjLPut8f)

The first object would store the symlink (../../../pool/community/go-ipfs-0.4.17-1-x86_64.pkg.tar.xz) as well as the ipfs object it points to (QmXLuLzshCejKHqP4fUdv7W7x5aVZKySjE8m8CKjLPut8f)

If a user does ipfs get or cat on the symlink, it should resolve to the contents of the file. Since the symlink by itself is useless.

If a user does an get on a directory structure that contains both the symlink and the target, it should download as a symlink.

symlinks outside of the directory structure being added or retrieved should never resolve unless the user explicitly says to dereference.

NiKiZe commented 6 years ago

@georgyo project to do arch mirroring: https://github.com/VictorBjelkholm/arch-mirror and has its own workaround for symlinks. There is relevant discussions in that project and linked documentation as well.

I fully agree with your points.

Stebalien commented 6 years ago

Yeah, that's a reasonable approach.

michaelavila commented 6 years ago

I'd like to tackle this specific issue and am looking for the consensus solution.

Here are some things I've noted from the conversation so far:

If the symlink doesn't leave the root node, we can leave them as symlinks (is that right?)
- We might need to change things like cat and ls to follow these symlinks.
If the symlink does leave the root node, we do what?
- Point to the hash of the content that the symlink points to (as I believe @georgyo is saying), OR
- Error as was mentioned here: https://github.com/ipfs/go-ipfs/pull/3508#issuecomment-267512093
If there are any cyclic symlink structures, add should immediately fail with a relevant error message.

cc @Stebalien @mitra42 @NiKiZe @georgyo

kevina commented 6 years ago

I actually really like @georgyo idea of storing both the link itself and the object it points to.

It will solve a long standing objection I have with resolving symbolic links in the gateway as mentioned in the unfinished p.r. here: https://github.com/ipfs/go-ipfs/pull/3508#issuecomment-267502271.

georgyo commented 6 years ago

Ultimately, I think mimicking all the symlink flags rsync has is the best idea, but that is a lot of scope creep for getting something working.

@michaelavila Here is my opinions on the questions you asked. I think you flipped when we would store a pointer to the object and when we would not.

If the object is within the tree we are adding/syncing then storing just the symlink is not sufficient. It needs to also store a reference of the ipfs object that it is pointing to. It would be impossible for a cat on the direct object to know what the bare symlink should point to, as it has no knowledge of the tree(s) it may be apart of.
If the symlink does leave the root node, I think I would keep storing it as a just a symlink, as we do today, by default is correct. A --dereference option could be added to follow these symlinks.
In this case a cyclic symlink structure and one with a depth of 1 billion are equally as bad, cause the client or server to follow a chain forever. A max depth limit is the correct way to handle this. I think a recursion check is also a good idea, but the depth limit is a requirement to prevent DOS attacks. 3a. It may also be good to block linking to /ipns/ objects.

Another major point is what happens when I pull down the tree vs what happens when I pull down the object. In all these case the symlink object has both the symlink information, as well as the reference to the ipfs object it was pointing to.

Case 1: The symlink object points to a file in the tree that is being fetched.

In this case, we should download the symlink as a symlink by default.

Case 2: The symlink object points to a file outside of the tree that is being fetched.

In this case we should download the file.

michaelavila commented 6 years ago

@Stebalien after a little research, I better understand @georgyo's proposal, but I have some questions:

First, I think the pairing of path/hash for symlink is a nontrivial change as it has to either:

modify unixfs.pb.Data (https://github.com/ipfs/go-unixfs/blob/master/pb/unixfs.pb.go#L69-L79), or
stuff hash into byte[] on unixfs.pb.Data alongside path (https://github.com/ipfs/go-unixfs/blob/master/unixfs.go#L98), or
some other thing I'm not considering

Questions:

Is the above true?
Which of those do you prefer? (or what am I not considering?)
How should we handle the retrieval of these symlinks in these two scenarios @georgyo mentioned?

If a user does ipfs get or cat on the symlink, it should resolve to the contents of the file. Since the symlink by itself is useless.

If a user does an get on a directory structure that contains both the symlink and the target, it should download as a symlink.

michaelavila commented 6 years ago

@georgyo we commented at the same time it looks like, but I've read your comment and that makes a lot of sense, thank you. Also, good point about using the concept of maxdepth instead of cyclic as it covers both cases, thanks for that.

I think I understand the proposal and am willing to move forward with it if @Stebalien is OK with that and also weighs in on my question of what approach to take when storing the symlink path + hash.

Stebalien commented 6 years ago

I also like @georgyo's solution.

@michaelavila I was worried about that and thought this might have a unixfs-v2 dependency but, on further consideration, I think this is pretty trivial: We can just add a "target" link to the links array in the DagPB node (we don't need to modify the unixfs protobuf at all). Older clients will simply ignore this link. The tricky part will be updating the link; I'm not actually sure there is an efficient way to do this unless we create some kind of link index.

For adding and retrieval, this is what I'm thinking:

On add, when we encounter a symlink:

If the symlink points within the directory we're adding, add an additional "link" (cid) to the symlink object that points to the actual file.
If the symlink points outside of the directory we're adding, don't inject this link.

This should just work in all cases.

On get, when we encounter a symlink, we'll have to provide a few options. I believe @georgyo is suggesting:

If the symlink points within the directory we're getting, just insert the symlink into the filesystem (unless we're using windows...).
If the symlink points outside the directory and it includes a "target" link, replace the symlink with the target file. Alternative: Make a symlink that points to /ipfs/Qm.... Unfortunately, this relies on fuse so it's probably not an option (for now).
Otherwise, just create the symlink.

The tricky part will be figuring out how to present this all to the user. We'll probably need something like:

get --symlink=auto: The solution I described above.
get --symlink=preserve: Keep symlinks exactly as they are.
get --symlink=resolve: Always resolve symlinks to files.

Summarizing, the tricky parts will be:

"doing the right thing" on get.
Modifying a directory with a symlink without having to traverse the whole directory.

@michaelavila I believe the simplest thing to start with is resolution. That is, the ability to run ipfs cat /ipfs/QmAbc/a/b/c/d where c is a symlink to ../../x/y/z/. While we work on that, we can try to hammer out the remaining issues here.

Stebalien commented 6 years ago

Note: while not currently on our OKRs for this quarter, I'd still consider this a priority because proper symlink support is necessary for a 1.0 release.

mikeal commented 6 years ago

Trying to catch up here since this was referenced in https://github.com/ipfs/unixfs-v2/pull/16

I'm not sure if encoding the CID into the node is the best way to resolve this issue. The way this works in other systems is that symlinks are strictly local pointers.

A symlink copied from one filesystem to another may not resolve to the same file. This is one of the hazards of symlinks. If we want to solve this issue then we probably shouldn't use symlinks :)

If the goal is to have this node always point to the same data referenced by the symlink at encode time then just make it the same unixfs file object. You're effectively trying to fix one of the hazards of symlinks and, in that case, you might as well go all the way and modify the actually representation to fit what you wish the filesystem was.

If we at any point need to preserve all the hazards of symlinks then we shouldn't try to encode the CID into them at all. Just encode where it points to and try to resolve it when you get the path in MFS.

What I see becoming increasingly problematic is trying to preserve several aspects of traditional filesystems while selectively solving specific issues, all at the same format layer. IMHO, the format layer should preserve as much about the original filesystem as possible, warts and all. If the ipfs add encoder wants to fix some things we don't like about the filesystem then it should probably do so by modifying the encoder to fit its new view of what the filesystem should be instead of complicating the format itself.

Stebalien commented 6 years ago

If we at any point need to preserve all the hazards of symlinks then we shouldn't try to encode the CID into them at all. Just encode where it points to and try to resolve it when you get the path in MFS.

The issue here was that IPFS isn't rooted so users may share /ipfs/CidOfFile, breaking the symlink. However, on the other hand, you're probably right: that's just a hazard of using a symlink and we shouldn't over-complicate the UnixFS layer.

teliten commented 2 years ago

i need a resolution, what's the point in doing "nothing?" people are using ipfs to store massive amounts of data; my personal opinion is that you should duplicate the file in full in place--and do the same thing you should always be doing, ensure it's only really stored once in IPFS.

choosing which file is the "hard file" ... if both were restored ... doesn't matter to me at this particular moment; though it does matter. i am trying to backup a database file that would be importantly on another partition--not my IPFS backup drive, if it were live. it's not so f-it. i have to copy the largest dataset i've ever seen "over and over" now because IPFS doesn't make anything immutable.

edit: even paying and utilizing every provider you have available. estuary, pinata, <f/fit> ... eternum, inferno ... I mean "infer" that there's a data retention in the universe problem. i just noticed your wikipedia ... problems will be handled. right now, IPFS has a problem, nobody is offering pinning services that are reliable; and you aren't offering a service that doesn't force me to ensure I have my data stored somewhere else also, which I do.

peace.

(((( my personal solution will always be the same. if i don't have an SLA i will use IPFS as a "hopeful" forever storage ... and anything I need to have "in more than just my own datacenter or more than just on S3 buckets" will be in 3 places, and not just on FIL. ))))

ipfs / kubo