rust-lang / rustup

The Rust toolchain installer
https://rust-lang.github.io/rustup/
Apache License 2.0
6.17k stars 888 forks source link

Rustup (including proxies) is not safe for concurrent use #988

Open matklad opened 7 years ago

matklad commented 7 years ago

(Edited to capture all the details that have emerged over time)

Recovering from this bug:

Usually just doing a rustup component remove NAME && rustup component add NAME will fix things. Sometimes removing the entire toolchain will be needed. In rare cases uninstalling rustup entirely will be needed.

User model of Rustup

Rustup may be run as three different commands:

Locking in Rustup

Rustup gets run concurrently in two very different contexts: within a single (machine, user), it may be run concurrently by the user, or an IDE, or both, to perform tasks ranging from toolchain installation, component addition, documentation opening. All of these require multiple reads to be made to the rustup data with a consistent view; some of them require writes to be made as well.

Rustup may also be run across machines, where a different machine but a shared rustup install is present - (machineA, userA) + (machineB, userA) - and in this case the same set of operations may take place, with the same requirements.

Whatever consistency solution we adopt would be best if it can deliver both use cases, and not require manual lock administration as lockdir style solutions do, nor additional running network daemons.

Proxies

Rustup has one set of proxies shared across all toolchains; the proxies are held open while any process from a toolchain is running - e.g. IDE's hold rls open for extended periods.

We need a lock to prevent concurrent attempts to install new proxies, and we need a notification mechanism back to the running proxies to allow them to be notified to exit when an update is required (because of presumed limitations of in-use-file-replacement on Windows, though recent changes may mean we can avoid this)

Toolchains

We have many toolchains in one rustup installation; stable, beta, nightly, dated nightly versions, and custom versions. Adding a toolchain adds a directory and a hash file; we need a lock to prevent collisions attempting to move the directory into place. Deleting a toolchain does a recursive rm in-place, which also needs a lock to prevent other rustup invocations presuming that the toolchain is actually installed during the time the deletion takes place (or perhaps we need to rename-then-delete, though that can run into race conditions with virus scanners, especially if the toolchain was just installed). Further, permitting deletions at any point will require notifications to running rls process proxies from that toolchain to cause them to shutdown, or the .exe is likely not deletable on Windows.

Components

Components such as rls are added into a toolchain directory, and also involve writing to a metadata file within the toolchain tree itself. This needs to be locked to avoid corruption/dropped writes. As with toolchains, we need proxy notification for component removal, as well as a way to make sure that a component that is being removed does not have new instances of it spawned between the start of the removal and the completion of the removal.

Downloads

We download packages for rustup itself, toolchains and additional components for toolchains, and (a corner case) custom installer executables for toolchains. We also download digital signature metadata files.

The same file can be downloaded by two different rustup invocations for two different reasons. For instance, downloading nightly and a dated nightly for today, will download the same file(s).

We used to leak partially downloaded files, and recently started deleting all download dir contents when rustup finished running; this is causing errors now.

We need some mechanism to deal with leaks, but also to permit concurrent execution of rustup to be downloading files without interruption. Possibly a date based mechanism or locking based mechanism would be sufficient.

Network file systems & rustup

Linux handles SMB mounts and locking on that filesystem well, at least per my reading of the source - a rustup dir on an SMB mounted file system using regular posix locks should have those locks reflected as SMB locks.

NFS is well known for having poor lock behaviour when the network services are not running or are firewalled; the underlying syscalls are only non-blocking on the filedescriptor themselves, and EWOULDBLOCK is defined as the lock being already held, not the OS being unable to determine if the lock is already held...

  [EWOULDBLOCK]
        The file is locked and the LOCK_NB option was specified.

So it is likely that we will see bug reports of rustup hanging when ~/.rustup is on NFS and the NFS server's lock RPC service is misconfigured or flaky. I think this be mitigated by emitting an NFS specific log message when taking a lock out on NFS once per process; with a config option to disable that for users that are confident they don't need it.... and a bug reporting hint to tell us they have disabled it.

Locks and notifications

OS locks typically don't have callback notifications built in; polling is one answer, or some form of lightweight message bus (e.g. zmq) with clear rules tied into the lock management layer. We have to be careful about race conditions though: in particular notifying before or after as appropriate.

brson commented 7 years ago

rustup needs to have concurrency support bolted on in general. Need to extract cargo's flocking code.

brson commented 7 years ago

To start with we can just bolt on a flock to the entire runtime of rustup.

matklad commented 7 years ago

I think I can probably work on this: I have some experience with Cargo's file [dead]locking :)

brson commented 7 years ago

@matklad ooh awesome. Can you extract cargo's fancy flock code to its own crate to share with rustup?

brson commented 7 years ago

Oh there's a tricky bit here in that rustup is reentrant. Can't just use a simple global lock.

matklad commented 7 years ago

Oh there's a tricky bit here in that rustup is reentrant. Can't just use a simple global lock.

Where exactly does reentrancy happen?

I don't yet understand how rustup works, but looks like adding a file lock to the Transaction should do the trick, unless transactions can nest and deadlock each other?

fenhl commented 5 years ago

Hi, is this still an issue? It might explain some problems I've been having with rustup.

kinnison commented 5 years ago

Yes, currently we have no locking on the toolchain directories IIRC.

rbtcollins commented 4 years ago

@matklad did that fancy flock code get extracted?

matklad commented 4 years ago

No: https://github.com/rust-lang/cargo/blob/master/src/cargo/util/flock.rs

I’ve also absolutely dropped the ball on this one :)

rbtcollins commented 4 years ago

No worries. Looking at it the semantics you have are different enough that it wouldn't be useful (e.g. creating directories on demand would run open up TOCTOU race conditions in untar secure handling for rustup), we don't want to lock each file because syscall latency would destroy performance on NFS and windows and so on.

I'll take the code as inspiration and do a similar thing though :) cheers!

rbtcollins commented 4 years ago

Ok, so looks like cargo has problems on NFS which I was worried about:

cifsfs supports oplocks so SMBfs should lock properly:

fs2 which cargo uses does use LOCK_NB with flock - flock(file, libc::LOCK_EX | libc::LOCK_NB) (https://tantivy-search.github.io/tantivy/src/fs2/unix.rs.html#36)

questions:

I'm inclined to provide a UI warning about NFS when taking the lock with a env variable to shut the warning up for folk that know things work properly in their environment (using the NFS sniffing logic from cargo); but then assume that the fs is working properly.

rbtcollins commented 4 years ago

Further thoughts: we need to mutually exclude operations on toolchains: removal and upgrade affect the entire toolchain as well as operations on components (add/remove components). I think locking the dir of the toolchain is probably the right control point.

metajack commented 4 years ago

We've been running into this on the Libra team as developers using CLion or IDEA manage to invoke rustup concurrently accidentally quite often when we check in toolchain file updates. This results in a borked toolchain install that is missing rustc, but fixes itself after a manual run of rustup toolchain install $TOOLCHAIN.

@rbtcollins Are you working on a PR to fix this now? If so, that would be awesome.

rbtcollins commented 4 years ago

Not actively working on a PR, but it is on my radar.

rbtcollins commented 4 years ago

Ok, so here's a bit of a specification, I think this ties together all the various bits involved.

Locking in Rustup

Rustup gets run concurrently in two very different contexts: within a single (machine, user), it may be run concurrently by the user, or an IDE, or both, to perform tasks ranging from toolchain installation, component addition, documentation opening. All of these require multiple reads to be made to the rustup data with a consistent view; some of them require writes to be made as well.

Rustup may also be run across machines, where a different machine but a shared rustup install is present - (machineA, userA) + (machineB, userA) - and in this case the same set of operations may take place, with the same requirements.

Whatever consistency solution we adopt would be best if it can deliver both use cases, and not require manual lock administration as lockdir style solutions do, nor additional running network daemons.

Proxies

Rustup has one set of proxies shared across all toolchains; the proxies are held open while any process from a toolchain is running - e.g. IDE's hold rls open for extended periods.

We need a lock to prevent concurrent attempts to install new proxies, and we need a notification mechanism back to the running proxies to allow them to be notified to exit when an update is required (because of presumed limitations of in-use-file-replacement on Windows, though recent changes may mean we can avoid this)

Toolchains

We have many toolchains in one rustup installation; stable, beta, nightly, dated nightly versions, and custom versions. Adding a toolchain adds a directory and a hash file; we need a lock to prevent collisions attempting to move the directory into place. Deleting a toolchain does a recursive rm in-place, which also needs a lock to prevent other rustup invocations presuming that the toolchain is actually installed during the time the deletion takes place (or perhaps we need to rename-then-delete, though that can run into race conditions with virus scanners, especially if the toolchain was just installed). Further, permitting deletions at any point will require notifications to running rls process proxies from that toolchain to cause them to shutdown, or the .exe is likely not deletable on Windows.

Components

Components such as rls are added into a toolchain directory, and also involve writing to a metadata file within the toolchain tree itself. This needs to be locked to avoid corruption/dropped writes. As with toolchains, we need proxy notification for component removal, as well as a way to make sure that a component that is being removed does not have new instances of it spawned between the start of the removal and the completion of the removal.

Downloads

We download packages for rustup itself, toolchains and additional components for toolchains, and (a corner case) custom installer executables for toolchains. We also download digital signature metadata files.

The same file can be downloaded by two different rustup invocations for two different reasons. For instance, downloading nightly and a dated nightly for today, will download the same file(s).

We used to leak partially downloaded files, and recently started deleting all download dir contents when rustup finished running; this is causing errors now.

We need some mechanism to deal with leaks, but also to permit concurrent execution of rustup to be downloading files without interruption. Possibly a date based mechanism or locking based mechanism would be sufficient.

Network file systems & rustup

Linux handles SMB mounts and locking on that filesystem well, at least per my reading of the source - a rustup dir on an SMB mounted file system using regular posix locks should have those locks reflected as SMB locks.

NFS is well known for having poor lock behaviour when the network services are not running or are firewalled; the underlying syscalls are only non-blocking on the filedescriptor themselves, and EWOULDBLOCK is defined as the lock being already held, not the OS being unable to determine if the lock is already held...

  [EWOULDBLOCK]
        The file is locked and the LOCK_NB option was specified.

So it is likely that we will see bug reports of rustup hanging when ~/.rustup is on NFS and the NFS server's lock RPC service is misconfigured or flaky. I think this be mitigated by emitting an NFS specific log message when taking a lock out on NFS once per process; with a config option to disable that for users that are confident they don't need it.... and a bug reporting hint to tell us they have disabled it.

Locks and notifications

OS locks typically don't have callback notifications built in; polling is one answer, or some form of lightweight message bus (e.g. zmq) with clear rules tied into the lock management layer. We have to be careful about race conditions though: in particular notifying before or after as appropriate.

kinnison commented 4 years ago

One aspect of rustup's data which needs locking and isn't on that list is the configuration (settings.toml) which can be altered by a number of rustup commands such as rustup default XXX or rustup override ...

ckaran commented 2 years ago

Is this still an issue? What about doing all work in the system's temp directory, and atomically renaming into the right directory afterwards?

There are two reasons I'm thinking about this solution. First, the system should handle cleaning up the temporary directory automatically. That means that if rustup crashes or does something else weird dribbling files around in the equivalent of /tmp, they will all be deleted by the system at some point in the future.

Second, IIRC file and directory renames are atomic in POSIX-compliant filesystems (maybe others as well? This needs to be verified). So, much like doing a compare and swap operation on an atomic pointer, only the first process that succeeds has any effect. Other processes/threads will get an error back saying that they failed the rename. This deals with many of the issues that locking is intended to solve.

All that said, replacing a toolchain may be more difficult because (IIRC) if a process is executing renaming the path to the executable has no effect on it. This will lead to a TOCTOU where a user starts using an old executable, the new one is atomically swapped in to replace the old one, and the old one looks for resources within the new directory structure. E.g., someone does rustup doc and rustup update concurrently, with the switching between processes being just awful enough that the old version of rustup doc is ends up searching within the new directory structure. I have no good answer for this.

rbtcollins commented 2 years ago

Yes it is. We already use atomic operations to the extent the filesystem in use can deliver them, but we don't build a lock out of them. Without a lock the range operations we perform lead to inconsistencies, further to that there is a performance issue and as you say, changing state underneath running processes is unkind.

On Fri, 25 Feb 2022, 16:06 Cem Karan, @.***> wrote:

Is this still an issue? What about doing all work in the system's temp directory, and atomically renaming into the right directory afterwards?

There are two reasons I'm thinking about this solution. First, the system should handle cleaning up the temporary directory automatically. That means that if rustup crashes or does something else weird dribbling files around in the equivalent of /tmp, they will all be deleted by the system at some point in the future.

Second, IIRC file and directory renames are atomic in POSIX-compliant filesystems (maybe others as well? This needs to be verified). So, much like doing a compare and swap operation on an atomic pointer, only the first process that succeeds has any effect. Other processes/threads will get an error back saying that they failed the rename. This deals with many of the issues that locking is intended to solve.

All that said, replacing a toolchain may be more difficult because (IIRC) if a process is executing renaming the path to the executable has no effect on it. This will lead to a TOCTOU where a user starts using an old executable, the new one is atomically swapped in to replace the old one, and the old one looks for resources within the new directory structure. E.g., someone does rustup doc and rustup update concurrently, with the switching between processes being just awful enough that the old version of rustup doc is ends up searching within the new directory structure. I have no good answer for this.

— Reply to this email directly, view it on GitHub https://github.com/rust-lang/rustup/issues/988#issuecomment-1050933552, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADZ7XULBQ7DH47OPKDGJXDU46LIPANCNFSM4DD5YUYA . You are receiving this because you were mentioned.Message ID: @.***>

ckaran commented 2 years ago

@rbtcollins OK. Just to clarify my own understanding of the issue as @matklad described it, if the update crashes while it is happening, it could leave the system in an indeterminate state, which is enough of a problem that we want some method of avoiding it completely, correct? If that is true, below is a first stab at solving the two problems. Every process is required to follow the steps below, or everything fails.

  1. Create a new temporary file in /tmp. My current understanding is that this is not guaranteed to be atomic, but that doesn't matter, the only requirement is that they try to create this file.
  2. Each process then tries to atomically rename their own temporary file to some well-known name, e.g. /tmp/rustup_lock_file. Bonus points if the contents of the file include a short readme so that anyone wondering what this file is can get an idea as to what it's for. It may even contain a URL pointing to this issue, or something similar.
  3. If the rename fails, it means that either some other running process owns the lock, or the other process crashed while holding the lock. In most cases it will mean the former, and as long as everything is proceeding as expected nobody will look at this file, and it will either be cleared at the end of the second phase (see below), or on the next reboot of the machine which is when the system will reset the contents of /tmp.
  4. If the rename succeeds, then that process now owns the lock. It now starts the second stage actions.
    1. It looks for a file within ~/.rustup named state_machine. If this file does not exist, then the last process to do any work succeeded, and the system is in a clean state. Since the process owns the lock, it can now safely create the state_machine file, and update its contents with status messages, state information, etc., etc., etc. safely. Once all work is done, the process first deletes the state_machine file, and then deletes the /tmp/rustup_lock_file, confirming that the former is truly committed before executing the latter.
    2. If the file does exist, it means that the last process to be doing updates failed somehow, leaving the directory in an unclean state. The state written within the file could be used to indicate what steps succeeded, and what steps need to be rolled back or finished, or whatever. Regardless, cleanup steps are initiated.

There are a few advantages to the above algorithm:

The one big disadvantage is that this is not a truly transactional algorithm. I don't yet know how to do that, but maybe some of the database literature would help with that?

Xuanwo commented 11 months ago

For those encountering a similar problem or directed to this issue, here's a quick workaround from https://github.com/rust-lang/rustup/issues/3530:

Ensure the rust-toolchain is set up by running cargo version before using other tools like matruin.

duaneking commented 4 months ago

Just found this issue again and the solution seemed to be to close all Visual Studio Code instances before doing the rustup update and that worked, but unfortunately it does look like Visual Studio Code is locking these files despite the windows not being active or in use.

rami3l commented 2 months ago

After thinking about this issue for quite a while, I have formed the following early ideas around a hypothetical file-based global RwLock:

The annoying thing here is that we also have to consider the subprocesses we're creating (since the critical sections could be nested), so:

... and it looks like if we're making a re-entrant lock, the uniqueness of its ownership should be based on the root of the process' subtree (up to a certain Rustup/proxy). This info might be conveniently passed on via a new env variable and/or via the lock file itself. Something like C#'s "named mutex" is close to what I'm looking for, which happens to be implemented using lock files on Unix.

All these points rely on the fact that a proxy cannot implicitly modify the current installation (so can be safely granted read-only access), which has fortunately been ensured by #3635.

Further down the line, more "Rustup mode" operations could be made to acquire read locks by default instead, and these locks might better be managed on a per-toolchain basis.

I'm actively trying to unblock this issue, although I'm not sure about the correctness of this reasoning... I'm trying this out in a local repo... Any input is welcome. 🤔


Update: @weihanglo suggests that such a RwLock already exists in Cargo (or more precisely, its variation, as it holds two reentrant file-based locks to form a custom 3-mode global RwLock), so that should be a solid starting point.

cc @rbtcollins what was the concern in https://github.com/rust-lang/rustup/issues/988#issuecomment-582285054 that made you think Cargo's lock has semantics different enough from what we'd expect?

rami3l commented 1 month ago

Regarding https://github.com/rust-lang/rustup/issues/988#issuecomment-2340369437, I've made a working on Unix PoC for a mocked rustup/cargo/rust-analyzer scenario using a combination of execvp(), fcntl() and open() syscalls on a marker file.

The outcome has been positive overall: rustup commands, for example, when trying to acquire a write lock, would be blocked by rust-analyzer's read lock.

The two key points mentioned in the previous comment seem to have been implemented, namely the inheritance of locks from parent process to child, and the upgrade of a read lock to a write lock on rustup commands, thanks to the fact that open() works across exec*() if CLOEXEC is not set (note: this is unfortunately not the case with std::fs). The currently-open FD is passed to the subprocess via an extra env variable, where its ownership is recovered by File::from_raw_fd(). This seems alright in terms of handle ownership, but I'm not quite sure if this is acceptable security-wise.

However, this comes with a notable limitation: since these properties (lock inheritance and lock upgrade) are limited to subprocesses launched using exec*(), this implementation might deadlock any proxy trying to run rustup commands (in incompatible ways) that require a write lock, e.g. cargo run running a program that uses Command::spawn() to run rustup toolchain install. Fortunately, thanks to the shared nature of the read lock, other use cases (including proxies) seem to work without issues, regardless of the nested critical sections.

The lock is currently not reentrant in-process, but when integrating the solution into rustup, it should be relatively easy to adapt cargo's approach to make it so.

Next up, I'll try to clean up the PoC and make it public for reference.

There are still many questions to be answered, to name a few:

ChrisDenton commented 1 month ago

I think Windows is a bit simpler here. On Windows the rustup process stays alive for the duration of the proxied binary no?

Locking a file can be done either by setting the share_mode or by using LockFileEx.

A third option would be to craft an rwlock from interprocess synchronization objects. However this may be trickier to get right.

rami3l commented 1 month ago

I think Windows is a bit simpler here. On Windows the rustup process stays alive for the duration of the proxied binary no?

@ChrisDenton That's where I got stuck.

It seems to me that the previous Unix example where someone uses cargo run to run rustup toolchain install with Command::spawn() also applies to Windows.

In this case, cargo launches rustup as a subprocess, and the two processes coexist. cargo acquires the read lock first, so rustup has no simple way to get a write lock unless it somehow has a way to inform cargo to abandon the lock temporarily when the read lock is in place.

OTOH, if on Unix we use Command::exec() (calling execvp() underneath), rustup will replace cargo, inheriting the latter's read lock, which can be easily upgraded to a write lock later on. On Windows, such an escape hatch doesn't exist, meaning proxies have no good ways of calling rustup.

Of course, we can admit the limitation and move on (maybe by supporting something like RUSTUP_UNSAFE_NO_LOCK=1), but whether we should do this remains debatable. I'm just pointing this thing out.

ChrisDenton commented 1 month ago

The flow is something like rustup -> cargo -> rustup, right? So there just needs to be a way to communicate back to the "root" rustup (and indeed to know if an ancestor rustup exists in the first place).

Though an issue I can see is that LockFileEx doesn't support upgrading locks. You'd have to release and reacquire, which would allow a window of opportunity for someone else to grab a write lock. Maybe that can be mitigated though.

rami3l commented 1 month ago

Oh wait, there's an execvp() function in Windows CRT, although I believe what it truly does is making the parent process quit immediately after launching the subprocess. It does help with unlocking though. I'm not sure if it fits our use case here.

If it does, I still think my approach would be acceptable regardless of the current limitation, especially when the RwLock becomes per-toolchain (when rustup is called to modify the current toolchain despite the fact that cargo is running on the same toolchain, something's gone clearly wrong).

ChrisDenton commented 1 month ago

Oh wait, there's an execvp() function in Windows CRT, although I believe what it truly does is making the parent process quit immediately after launching the subprocess. It does help with unlocking though.

Yes, though that would break anything that waits on rustup process. Such as the console.

If it does, I still think my approach would be acceptable regardless of the current limitation, especially when the RwLock becomes per-toolchain (when rustup is called to modify the current toolchain despite the fact that cargo is running on the same toolchain, something's gone clearly wrong).

Fair enough! 👍

rami3l commented 1 month ago

Oh wait, there's an execvp() function in Windows CRT, although I believe what it truly does is making the parent process quit immediately after launching the subprocess. It does help with unlocking though.

Yes, though that would break anything that waits on rustup process. Such as the console.

@ChrisDenton Thanks for the info.

Given that, I guess it might make some sense to go back and redesign a lock manager daemon of some sort, as suggested by @rbtcollins, instead of communicating with the parent Rustup instance directly:

my thinking - not well written up - is that a persistent server process would make this locking a lot easier for a local workstation, at the cost of not working on a network drive. There are fancy mechanisms to do locks with just atomic fs operations but really, in 2024 networked setups are rarely developer environments, much more likely to be buildfarms where a single head node is all that needs to be configured.

A persistent server process could (for instance) start implicitly per-toolchain, shutdown after 10 seconds of inactivity, and while running broker locks, logs, interact with vscode etc. But its a bit of a bigger concept and I know some folk are highly allergic to the very idea. (Though I bet they still use a web browser and other tools that have a similar background task) https://discord.com/channels/442252698964721669/463480252723888159/1285284728341266452

This also helps ensuring cross-platform compatibility without having to deal with platform specifics like I currently do.