Operations on the FUSE file system can affect files not visible to the FUSE file system

richardxia commented 10 months ago

cc @ngraybeal, who did the bulk of the work in getting a repro and narrowing it down

It looks like there is a file system sandbox safety issue in scenarios where a file exists in the host FS but is not made visible to the sandboxed process and the sandboxed process attempts to create a file at the same relative path. Although this was intentional in some scenarios, it can lead to very unexpected results and even safety issues.

Reproduction steps

Here's the repro that @ngraybeal came up with:

1st job creates a foo directory with a couple files. 2nd job creates the same foo directory, writes to one of the same files as the 1st job, proceeds to cp the contents of foo to a new dir foo2, it then mv foo to foo3 and does removes foo3.

package bug
from wake import _
export def test _args =
    def cmdline =
        """
            mkdir "foo"
            echo "hello" > foo/hello.txt
            touch foo/cmdline1.txt
        """
    require Pass _ =
        makeShellPlan cmdline Nil
        | runJob
        | getJobOutputs
    def cmdline2 =
        """
            mkdir "foo"
            echo "goodbye" > foo/hello.txt
            touch foo/cmdline2.txt
            cp -rf foo foo2
            mv foo foo3
            rm -rf foo3
        """
    makeShellPlan cmdline2 Nil
    | runJob
    | getJobOutputs

Here's what I expect the output in my file system to be.

foo
- hello.txt (with "hello")
- cmdline1.txt
foo2
- hello.txt (with "goodbye")
- cmdline2.txt

Here's what actually gets written

foo2
- hello.txt (with "goodbye")
- cmdline2.txt
foo3
- cmdline1.txt

Alternative Scenario

There's a different variation of this that affects a single job but multiple runs. One easy way to hit this is when creating a Python virtual environment and upgrading the version of the pip package manager. Here is a rough repro, but it will depend on an external resource for providing a version of Python:

export def pipTest _args =
    """
    python3 -m venv venv
    ./venv/bin/pip install -U pip
    """
    | makePlan "pip test" Nil
    | setPlanPersistence Once
    | setPlanResources ("python/python/3.10.6", Nil)
    | runJob

Running this twice causes a strange ~ip directory to appear in the venv/lib/python3.10/site-packages/ directory, even though it doesn't appear the first time around.

The reason for this requires a bit of extra explanation of what pip is doing under the hood. In the above example, we are actually installing pip twice: once when creating the virtual environment and once with the explicit pip install -U pip command. When pip upgrades a package, it moves the older version of the package to a temporary directory that starts with a tilde (the ip in ~ip come from pip). If the upgrade is successful, it deletes the temporary directory, but if it fails, then it restores the temporary directory.

The way this interacts with the wake FUSE file system is the following:

Run 1:
- venv is created with pip already installed in it
- We ask pip to upgrade itself
- pip moves the old files from pip to ~ip
- pip installs the newer version of pip
- pip removes ~ip
- Everything is synced properly to the host FS
Run 2:
- Note that the venv already exists on the host FS, but we don't make it visible in the FUSE FS
- venv is created in the FUSE FS with pip already installed
- fuse-waked allows all writes to that venv/ directory through to the host FS. The host FS now has a bit of a frankenstein of files from both versions of pip installed (the newer version from the old run and the older version from the new run)
- We ask pip to upgrade itself
- pip moves the old files from pip to ~ip
- fuse-waked executes a mv on the host file system, but this moves files that were in the pip directory that were invisible to sandboxed process into the ~ip` directory as well
- pip installs the newer version of pip
- pip removes ~ip
- fuse-waked removes each file in ~ip that was visible to the sandboxed process from the host FS. This leaves around files that the sandboxed process did not know about but that fuse-waked had moved anyway a few steps above. The end result is that the ~ip directory still exists in the host FS

JakeSiFive commented 10 months ago

Long term solution proposal:

ditch our current fuse implementation and switch to overlayfs
wakebox will create a "read" directory and a "write" directory somewhere hidden like in .build
wakebox will then hard link all visible files into the read directory
wakebox will then create an overlayfs daemon with a read only layer using the read directory, and the write directory over that
After everything is done the outputs can be collected by wakebox by traversing the write directory
"merge" everything back, handling collisions somehow (possibly just by overwriting)

JakeSiFive commented 10 months ago

From there you're close to virtulization, you just also need a blob store that you read/write to when constructing these sandboxes. The blob store can be cleaned up at the end of operation. Non-sandboxes jobs will then have to require a lock to run and you can then run multiple wakes in parallel.

richardxia commented 10 months ago

I assume that OverlayFS doesn't have a way to track reads? If so, then one thing to note is that this will remove a feature of the FUSE runner, which is that you can throw a ton of visible files at a job, but wake will keep track of only the ones that were read by the job in order to determine when it will need to be rerun in the future. If we're OK with this tradeoff, then we probably at least need to do an audit of our internal jobs to see where we should be more specific about which files are sent as visible files. I think this mostly just affects source files (e.g. ones committed to Git), since it's a common pattern to source everything in a directory tree and then pass it to the compiler to go figure out what the dependency chains were.

I guess another option could be to continue using FUSE but to have it be backed by a temporary OverlayFS volume? This would give us better encapsulation while also allowing us to track reads.

JakeSiFive commented 10 months ago

Yeah it has no way to track reads but I think thats a good thing because it was never implemented correctly and implementing it correctly buys us very little. It also greatly simplifies job match criteria

V-FEXrt commented 10 months ago

I can't remember, but did we not already make the change that jobs rerun based on visible list and not read list? I know we talked about it before but I don't remember what the resolution was

richardxia commented 10 months ago

I can't remember, but did we not already make the change that jobs rerun based on visible list and not read list? I know we talked about it before but I don't remember what the resolution was

Yeah, @JakeSiFive let me know offline that we did that. We apparently did that in the runner of our internal codebase rather than in the wake repo itself.

sifive / wake

Operations on the FUSE file system can affect files not visible to the FUSE file system #1501

Reproduction steps

Alternative Scenario