Fix/stage symlinked files

evgmik commented 3 years ago

Fixes #101

Also fixes exotic case when now to be staged files are detected (i.e. we asked to stage symlinked directory), in this case there were endless wait for pbars

sahib commented 3 years ago

I guess the symwalk package can be copied and adjusted to support:

A maximum depth of traversal.
Not visiting the same node (identified by inode/device_id) more than once.

evgmik commented 3 years ago

I guess the symwalk package can be copied and adjusted to support:
* A maximum depth of traversal.

What we already have does it. I run some test and I see no obvious bugs in it.

* Not visiting the same node (identified by inode/device_id) more than once.

I would disagree on this. I think this is the feature of symlinks: multiple links can point to the same inode. If we remove it, it would be very similar as not having symlinks at all (except the cases when they point outside of the staging root).

My use case: having a template dir (with makefiles, styles, and common configs). I have quite a lot of link to this template dir within a working tree. Think 'lectute template' to which weekly lectures point.

evgmik commented 3 years ago

Also, did we agree on an option to follow links and/or to ignore errors?

This is in my TODO list. I was looking for a general approval of the symlink handling, before adding extra options.

An embarrassing story: I spent 5 hours figuring out how to append a slice returned by a function. The go error reporting was very unclear about it. ... did the work, but it took me long time.

sahib commented 3 years ago

I guess the symwalk package can be copied and adjusted to support:
* A maximum depth of traversal.
What we already have does it. I run some test and I see no obvious bugs in it.
* Not visiting the same node (identified by inode/device_id) more than once.
I would disagree on this. I think this is the feature of symlinks: multiple links can point to the same inode. If we remove it, it would be very similar as not having symlinks at all (except the cases when they point outside of the staging root).

My use case: having a template dir (with makefiles, styles, and common configs). I have quite a lot of link to this template dir within a working tree. Think 'lectute template' to which weekly lectures point.

Hmm, you have a point here. I was more worried about such pathological (hah, "path") cases:

λ werkbank sub → cd /tmp && mkdir sub && cd sub
λ werkbank sub → dd if=/dev/urandom of=file.big bs=1M count=100
λ werkbank sub → ln -s . link
λ werkbank sub → ls
file.big  link
λ werkbank sub → cd link
λ werkbank link → cd link
λ werkbank link → cd link
λ werkbank link → cd link
λ werkbank link → cd link
λ werkbank link → cd link
λ werkbank link → cd link
λ werkbank link → cd link
λ werkbank link → pwd
/tmp/sub/link/link/link/link/link/link/link/link

...in that case the current code would add an infinitely nested structure, each level with file.big in it. You could say that this is normally not true, but I consider such things a security issue because you could easily DDoS a machine with brig if you somehow manage to create a symbolic link and make it stage something.

So both solutions lack something:

Ignoring cycles limits valid use cases.
Not ignoring cycles opens room for malicious input.

I really don't like the concept of symbolic links. Without supporting some sort of "link" in brig we can't really cover both cases and that is really out of scope. Maybe one way would be to "burn" a directory-symlink, so we can traverse it only once and won't the next time. Since I can't come up with a good solution here you have my go to proceed. Might play around at some point, so please leave a TODO with the problem described above.

An embarrassing story: I spent 5 hours figuring out how to append a slice returned by a function. The go error reporting was very unclear about it. ... did the work, but it took me long time.

Sorry to hear that :grin: The ... thing isn't very intuitive also. I think I wrote a for loop around it the first time.

evgmik commented 3 years ago

λ werkbank sub → cd /tmp && mkdir sub && cd sub
λ werkbank sub → dd if=/dev/urandom of=file.big bs=1M count=100
λ werkbank sub → ln -s . link
λ werkbank sub → ls
file.big  link
/tmp/sub/link/link/link/link/link/link/link/link
...in that case the current code would add an infinitely nested structure, each level with file.big in it. You could say that this is normally not true, but I consider such things a security issue because you could easily DDoS a machine with brig if you somehow manage to create a symbolic link and make it stage something.

Just small correction: we already have recursion depth protection with the current code. The limit is set to 255 in the current walk code, see 6a1e67f line 73. What would be nice is to reference file.big to the same backend hash. Unfortunately, we lost the automatic way for this when we switched to random encryption key from the content generated one.

But since I collect long list of files toBeStaged I can see which of them dereferenced to the same local file. And just copy the node info. This will save quite a lot of space at the backend for some scenarios, it also could speed up staging for heavily linked cases.

We might even attempt it with directories, but this is scary. I am not sure our linker will survive it.

So both solutions lack something:
* Ignoring cycles limits valid use cases.

* Not ignoring cycles opens room for malicious input.
I really don't like the concept of symbolic links. Without supporting some sort of "link" in brig we can't really cover both cases and that is really out of scope. Maybe one way would be to "burn" a directory-symlink, so we can traverse it only once and won't the next time. Since I can't come up with a good solution here you have my go to proceed. Might play around at some point, so please leave a TODO with the problem described above.

Sure thing I will leave the TODO.

evgmik commented 3 years ago

I am still working on proper link handling.

evgmik commented 3 years ago

still working on it. In particular, on deduplication: if several repo paths points to the same content via symlinks during staging, then we should just copy repo node to new repopaths. This is how I triggered bug #105

sahib commented 3 years ago

How are you implementing the linking? Are you using fs.Copy() or do some low-level trickery? Or something different entirely more on client-side?

evgmik commented 3 years ago

How are you implementing the linking? Are you using fs.Copy() or do some low-level trickery? Or something different entirely more on client-side?

I am going to use fs.Copy(). It seems to be sufficient except it does not copy to non existing path (at least on a first glance).

evgmik commented 3 years ago

Status update. Looks like I completed reflinking of the symlinked files. Next is proper treatment of the symlinked directories, then implementation for options.

evgmik commented 3 years ago

Still working on hardlinks and dirs

evgmik commented 3 years ago

Do we really want to implement hardlinks reflinking?

To track hardlinks we have to use OS dependent package x/sys. I would argue that this is an overkill and will be hard to maintain. The price we pay is some data duplication (I start to regret our decision on a random key choice for every file, instead of content based.)

I personally never used hardlinks in real life. But I need a weighted @sahib opinion on it.

One more thing: I decided to skip copying symlinked directories. The current algorithm walks such directories and copies its content, instead of fs.Copy(src_dir, dest_dir). But I think this is not performance crucial. The reason: it is hard to implement properly, since I need to track which dir is already staged to do its copy, so the code grows for a questionable speed up.

Other than this, I will be working on skip symlinks and continue on the minor staging errors switches/options.

sahib commented 3 years ago

Do we really want to implement hardlinks reflinking?

It's not an easy discussion. Here's my take.

I care mostly for consistency. If we support symbolic links we should also support hardlinks. To be fair, using fs.Copy is more an optimization more or less invisible to the user, but still. Good software should support all normal operation that a user encounters or, if that seems to be out of reach for whatever reason, limit itself to a sensible subset. The latter is the way that Go choose, by ignoring everything in filepath.Walk() that's not a regular file. So, let's ask ourselves if we're consistent:

Staging works for both hardlinks, symbolic links (both regular and directory) and regular files.
The copy-optimization works only for symbolic links, but only if they point to single files.
Broken links and loops are handled. That's more a security requirement though.

The second point is inconsistent, but I would be okay to let it slip since it's a non-essential optimization and it's likely that most users don't even care or know. So I'm kinda okay with the current state of not explicitly supporting for the following reasons:

There is no proper generic walk implementation that does the right thing. Maybe some day I will write one...
I don't see it as an highly important optimization right now. I think it's way to early for such detail optimization in general.

I personally never used hardlinks in real life. But I need a weighted @sahib opinion on it.

I use them quite often. For example parts of my music library are hardlinks and I use them to create "playlists" that can be easily copied to elsewhere without fear of tools that do not handle symbolic links right. But also many tools like ostree, git and many more are based on them. Personally, I see them more often than symbolic links.

To track hardlinks we have to use OS dependent package x/sys.

Well, yes. It adds a bit of complexity, but not much. One could just have a "key" type that is either dev/inode on unix or path on windows.

Other than this, I will be working on skip symlinks and continue on the minor staging errors switches/options.

:+1:

evgmik commented 3 years ago

I was convinced that hardlinks are a nice feature (should I expect anything different from rmlint author ? :)

I have added hardlinks reflinking with 9f93d01. It is tested only on linux, and on other systems should fall back to just staging a hardlinked file independently (not sure it supported by Windows at all).

Technically, it should work on any unix like system, not just linux. But I do not see the way to make a build flag similar to

// +build linux

if I replace linux with unix the system detects both inodeString declarations. But it should be a minor thing which could be tested if we get a hold of a Mac

sahib commented 3 years ago

Nice! Implementation looks okay to me. Will have deeper look once the options are also implemented.

[...] should I expect anything different from rmlint author?

Damn, my past catches up. :smile: But yes, not scanning hardlinks twice was an issue there.

Technically, it should work on any unix like system, not just linux. But I do not see the way to make a build flag similar to [...]

If I understood you right you could have the following configuration (untested, but did something similar once):

$ ls inode_*
inode_unix.go inode_other.go
$ head -1 inode_unix.go
// +build !windows
$ head -1 inode_other.go
// +build windows

Under the assumption that pretty much everything beside windows is a unix or at least has inodes. Even if that does not apply I think it's better to keep it like that so users can complain when compiling for this platform.

evgmik commented 3 years ago

If I understood you right you could have the following configuration (untested, but did something similar once):
$ ls inode_*
inode_unix.go inode_other.go
$ head -1 inode_unix.go
// +build !windows
$ head -1 inode_other.go
// +build windows
Under the assumption that pretty much everything beside windows is a unix or at least has inodes. Even if that does not apply I think it's better to keep it like that so users can complain when compiling for this platform.

Excellent suggestion, implemented with 38704e1

Now, back to the staging options...

evgmik commented 3 years ago

I have added options:

to skip symbolic links: --no-dereference, -P
to continue on error during staging: --continue-on-error, -c

@sahib I think this patch set is ready for the final review

sahib commented 3 years ago

Small personal note: I have to sort out some personal stuff and and will be slow to respond/develop in the next 1-2 weeks.

evgmik commented 3 years ago

Small personal note: I have to sort out some personal stuff and and will be slow to respond/develop in the next 1-2 weeks.

No worries. I will start working on 'copy' related bugs #105 and #106

sahib / brig

Fix/stage symlinked files #102