rust-lang / rust

Empowering everyone to build reliable and efficient software.
https://www.rust-lang.org
Other
98.23k stars 12.7k forks source link

Current handling of Unix close() can lead to silent data loss #98338

Open jgoerzen opened 2 years ago

jgoerzen commented 2 years ago

While working on #98209 , I went searching for where close() is called. I found it in std/src/os/fd/owned.rs:

impl Drop for OwnedFd {
    #[inline]
    fn drop(&mut self) {
        unsafe {
            // Note that errors are ignored when closing a file descriptor. The
            // reason for this is that if an error occurs we don't actually know if
            // the file descriptor was closed or not, and if we retried (for
            // something like EINTR), we might close another valid file descriptor
            // opened after we closed ours.
            let _ = libc::close(self.fd);
        }
    }
}

The Linux close(2) manpage states:

       A  careful programmer will check the return value of close(), since it is quite possi‐
       ble that errors on a previous write(2)  operation  are  reported  only  on  the  final
       close()  that  releases  the open file description.  Failing to check the return value
       when closing a file may lead to silent loss of data.  This can especially be  observed
       with NFS and with disk quota.

       Note,  however,  that  a  failure  return  should be used only for diagnostic purposes
       (i.e., a warning to the application that there may still be I/O pending or  there  may
       have been failed I/O) or remedial purposes (e.g., writing the file once more or creat‐
       ing a backup).

       Retrying the close() after a failure return is the wrong thing to do, since  this  may

Anyhow, checking the return value of close(2) is necessary in a number of cases. But here we have a conundrum, because what can we possibly do with a failure of close(2) inside a Drop impl?

  1. Ignore it as now
  2. Panic
  3. More complex options (eg, allow the caller to register a close-failure callback with the underlying File/whatever object)

These all have their pros and cons. But aren't we looking for something more like this?

fn close(self) -> io::Result<()>

In fact, a Close trait could define this function and it could be implemented on files, pipes, sockets, etc.

Meta

rustc --version --verbose:

rustc 1.56.1 (59eed8a2a 2021-11-01)
binary: rustc
commit-hash: 59eed8a2aac0230a8b53e89d4e99d55912ba6b35
commit-date: 2021-11-01
host: x86_64-unknown-linux-gnu
release: 1.56.1
LLVM version: 13.0.0

the8472 commented 2 years ago

If - assuming we're talking about regular files here - you need durability then relying on close is the wrong thing because data may only make it to the page cache and later encounter an error during writeback. You should rely on fsync instead.

let file = File::from(owned_fd);
file.sync_all()?;

Of course that's expensive since it forces the OS to do immediate writeback, so it should only be done when durability is critical.

ChrisDenton commented 2 years ago

I think it makes sense to have a close method for OwnedFd, OwnedHandle, etc if only for diagnostic purposes. There's not a whole bunch you can do with the error if you get in that situation, other than logging or maybe restarting the whole operation over again. But maybe it can help diagnose a problem.

The API proposal is just drop but doesn't swallow the error, right?

impl OwnedFd {
    fn close(self) -> io::Result<()>;
}
jgoerzen commented 2 years ago

@the8472 close(2) on Linux even states:

       A  successful  close  does not guarantee that the data has been success‐
       fully saved to disk, as the  kernel  uses  the  buffer  cache  to  defer
       writes.   Typically,  filesystems  do  not  flush buffers when a file is
       closed.  If you need to be sure that the data is  physically  stored  on
       the underlying disk, use fsync(2).  (It will depend on the disk hardware
       at this point.)

But it also includes the comment I quoted above about the importance of still checking the return value of close().

@ChrisDenton A scenario in which an error here may be relevant could be this:

Let's say you are processing items from a queue of some sort. If processing is successful, you remove the item from the queue. If not, it remains on the queue. An error from close would be a reason to leave the item on the queue, or to signal failure to the network client, etc.

I don't know the particulars of the NFS issue, but I can imagine them. I also imagine various FUSE filesystems (eg, s3fs) do the bulk of their work only when close is called.

Another scenario I could imagine would involve pipes. If a prior write() put data into the pipe buffer, but the process on the other end of the pipe crashed before reading it, the only way to detect this would be via SIGPIPE (which is blocked by default in Rust; see #97889) or, I would imagine, by the return value of close(). (I have not personally verified this)

@ChrisDenton Yes, you are correct about my proposal. It should consume and cause the dropping of the OwnedFd because, as the close(2) manpage notes, you really shouldn't attempt to close it again, even on error (with a side asterisk that certain Unices, such as HP/UX, you SHOULD close it after EINTR, but this is ambiguous in POSIX and is not the case on Linux at least; I'd argue that if somebody ports Rust to one of those OSs, EINTR handling should be internal to fn close there.)

I would also like to see it bubbled up to the higher-level interfaces (File, sockets, etc). This shouldn't be difficult to use in cross-platform code, I'd hope. I don't know if Windows has the same close() semantics, but even if on Windows it is identical to drop, having an interface available would be helpful.

The one trick I'd see would be preventing a double-close when there is a call to fn close (by drop).

the8472 commented 2 years ago

But it also includes the comment I quoted above about the importance of still checking the return value of close().

My understanding is that fsync subsumes all that as long as the system defines _POSIX_SYNCHRONIZED_IO, which is the case on all major unix systems. E.g. postgres heavily relies on fsync for ACID guarantees, not close.

Looking at the return value can only ever be a diagnostic thing. Unless you're opening a file with O_SYNC or O_DIRECT close doesn't guarantee anything due to writeback.

I don't know the particulars of the NFS issue, but I can imagine them. I also imagine various FUSE filesystems (eg, s3fs) do the bulk of their work only when close is called.

NFS supports fsync. Anything that ignores fsync is not a durable filesystem and you shouldn't entrust critical data to it.

Another scenario I could imagine would involve pipes. If a prior write() put data into the pipe buffer, but the process on the other end of the pipe crashed before reading it, the only way to detect this would be via SIGPIPE (which is blocked by default in Rust; see https://github.com/rust-lang/rust/issues/97889) or, I would imagine, by the return value of close(). (I have not personally verified this)

Close has no special treatment for pipes. Any data still stuck in a pipe is lost.

sunfishcode commented 2 years ago

I also imagine various FUSE filesystems (eg, s3fs) do the bulk of their work only when close is called.

FUSE for its part documents that filesystems should avoid doing this.

jgoerzen commented 2 years ago

@the8472 and @sunfishcode I would like to understand -

1) Are you folks opposed to having an explicit close() that returns an error?

2) Or are you fine with that, but dislike my justification?

If 1, how do you reconcile that with the strong admonishment in close(2) that "failing to check the return value when closing a file may lead to silent loss of data"?

If 2, I'd like to hear more.

Over at https://stackoverflow.com/questions/24477740/what-are-the-reasons-to-check-for-error-on-close there is a conversation about this which I found interesting. Among the points are:

"Consider the inverse of your question: "Under what situations can we guarantee that close will succeed?" The answer is: when you call it correctly, and when you know that the file system the file is on does not return errors from close in this OS and Kernel version"

It is this second that has me particularly troubled. On my Debian Linux box, there are 58 in-kernel filesystem modules and dozens more implemented via FUSE. Some of them (eg, ext4) are specifically designed for POSIX semantics. Some of them (eg, cifs, exfat) were not. I use sshfs and NTFS via FUSE on a regular basis and who knows about them; I am certain they are not fully POSIX compliant. Even NFS has never been fully compliant with POSIX semantics, only getting somewhat better in more recent years.

I'm not arguing for everything to be perfect here; in a strict sense, if one wants perfect ACID, one would not just fsync, but also check the return value of fsync and close, and fsync the parent directory after a rename also. Except on MacOS, where fsync doesn't actually force data to durable storage, unlike Linux, where fsync explicitly does.

I just want the tools to be a careful programmer wherever possible, that's all.

jgoerzen commented 2 years ago

This is probably a separate issue, but an incidental remark: I don't see any way to fsync a directory in std, either. File::open() won't open a directory (perhaps via O_DIRECTORY in custom_flags, but there is nothing in std that exposes O_DIRECTORY), and read_dir doesn't expose the fd for syncing either.

sunfishcode commented 2 years ago

@the8472 and @sunfishcode I would like to understand -

1. Are you folks opposed to having an explicit close() that returns an error?

I myself am.

If 1, how do you reconcile that with the strong admonishment in close(2) that "failing to check the return value when closing a file may lead to silent loss of data"?

Ignoring errors from close is so pervasive that filesystem developers that care about robustness are practically expected to make sure close never fails.

Rust has ignored errors from close since 1.0, which is a precedent across the ecosystem. If we change that, it will create a new expectation for almost all Rust libraries and utilities that write files. It's tempting to say "you don't need to check if your use case doesn't need it", but libraries and utilities usually don't pick the filesystems they run on, or how data is used outside their scope.

If we approach this issue as "Do we care about this kind of data loss?", it's difficult to explain saying no. But we can instead ask "Who is responsible for this kind of data loss?", and there are multiple options:

Sometimes fixing existing code to correctly handle errors from close is easy, but sometimes it's costly. Either way, the new code paths will be rarely exercised and difficult to test. There is an ecosystem-wide cost for adding this expectation.

To my knowledge, the main filesystems where close does fail in practice are networked filesystems which postpone reporting errors in order to go faster (NFS calls this "async"), and FUSE filesystems which ignore the documentation. End users using NFS and caring about the data being written to it should probably already be actively monitoring for free space and other conditions, because of the amount of existing code out there that doesn't check for these errors. FUSE authors that care about protecting their users' data should already be following the documentation for flush.

If NFS were as common as it once was, things might look different, but in practice, POSIX filesystem APIs, with all the guarantees and common assumptions, aren't a great fit for networks. Instead of exposing the async nature of networks to applications via mechanisms that could map to Rust's async, NFS' "async" mode tries to hide everything behind the illusion of a synchronous API. Filesystem APIs can be convenient, but they have significant costs.

Neither option is without downsides. I don't have data, but I assume the costs of passing on the responsibility for errors from close to individual developers throughout the Rust ecosystem would be greater than the cost of stating that this is not the ecosystem's responsibility, and passing the responsibility onto end users that choose to use NFS, and FUSE filesystem authors, who arguably already have this responsibility.

As a separate consideraion, stdout can be redirected to a file, and all the same data-loss scenarios are possible when writing to a file via stdout. If we add an expectation to Rust that code check for errors from close after writing to files, we should add a way to close stdout and check for errors from that too. But that's tricky, because Stdout::lock returns a &'static, which effectively guarantees that stdout is always open.

I just want the tools to be a careful programmer wherever possible, that's all.

This is why it's useful to make these decisions at a broad level, like Rust (or ideally higher, but one step at a time), so that we can collectively agree on what should be expected of individual careful programmers.

I also aspire to be a careful programmer, and I once added logic to LLVM to close stdout and check for errors, because LLVM is often used to write to stdout with redirection to files. I too read the close man page and believed it was my responsibility. That code lived in the LLVM codebase for years, but only with me defending it against a long series of people hitting problems with it and wanting to remove it. I eventually stopped defending it, and it has since been removed. It was just too much trouble to be worth the cost.

the8472 commented 2 years ago

Are you folks opposed to having an explicit close() that returns an error? Or are you fine with that, but dislike my justification?

I'm ambivalent about adding the API. Generally it does not make sense to check for errors on close. But I can see some niche uses when hunting for errors or having a policy of failing as loudly and as explosively as possible then it might make sense to print an error and then abort when a close fails (what else are you going to do? there likely is no sane path to recovering from an error).

But I do dislike the justification. Checking for close is not what you do when you care about your data making it to storage intact. As I said earlier, fsync is the go-to mechanism here. To me it would be purely a diagnostic thing that something has gone horribly wrong.

File::open() won't open a directory

But it does? (playground)

Some of them (eg, ext4) are specifically designed for POSIX semantics. Some of them (eg, cifs, exfat) were not. I use sshfs and NTFS via FUSE on a regular basis and who knows about them; I am certain they are not fully POSIX compliant. Even NFS has never been fully compliant with POSIX semantics, only getting somewhat better in more recent years.

Sure but would you run a production database on a dodgy USB stick formatted with exfat?

jgoerzen commented 2 years ago

Thank you for these thoughtful comments.

I definitely do not want to create a precedent to require libraries to check close(). You are right that this would be unworkable. Already I wouldn't necessarily trust libraries with writing (are they doing the POSIX-safe way to do atomic writes, with fsyncs and renames? Almost uniformly not, but then you may not always want that).

I want the option to be able to check it myself in a safe way, that's all.

There are a lot of machines out there that are running on filesystems that I wouldn't count to apply these guarantees, or have other errors that may be signaled at close. FreeBSD can return ECONNRESET for instance. An entire class of devices (Raspeberry Pis) typically runs on dodgy micro SDs. I particular bane of my existance for years has been VMs (and even hardware) that fails to propagate fsync. I /frequently/ work with software that has to deal with USB sticks formatted with exfat or NTFS (or, heck, vfat). While NFS no doubt isn't as prolific as it once was, it is still with us, and now we have a bunch of others in that range: afs, coda, glusterfs, ocfs2, ceph, cifs, davfs, sshfs, s3fs, rclone -- to name a few. We have a /proliferation/ of filesystems or filesystem-like things, actually. It is an explicit design goal of the program I was working on when I created this issue to be maximally compatible with filesystems that differ from POSIX semantics in significant ways; for instance, s3fs, in which a file created may not be able to be opened for reading right away due to eventual consistency.

I care about data integrity. I am realistic that checking the return value from close is not itself sufficient to gaurantee it. I don't believe there exists a perfect guarantee, in fact, because hardware isn't perfect. But I want to take every step I reasonably can, and which documentation has showed I ought to, along that path -- in part because I realize I am dealing with filesystems with a lot more failure modes than ext4 on a local SSD.

I wouldn't try to run PostgreSQL atop S3. But for Filespooler, written in Rust, this is an explicit design goal and documented practice. Although written in Go, a similar tool, NNCP, is something I use for getting data to airgapped backup & archival systems.

the8472 commented 2 years ago

There seems to be some separate scenarios here

I definitely do not want to create a precedent to require libraries to check close(). You are right that this would be unworkable.

Ok. Then imo it should be documented in a way that it redirects users to sync_all if they want durability and detect IO errors and basically discourage its behavior due to being non-portable and FS-dependent.

tbu- commented 1 year ago

Prior discussions, sorted from oldest to newest:

workingjubilee commented 6 months ago

Rust has ignored errors from close since 1.0, which is a precedent across the ecosystem. If we change that, it will create a new expectation for almost all Rust libraries and utilities that write files. It's tempting to say "you don't need to check if your use case doesn't need it", but libraries and utilities usually don't pick the filesystems they run on, or how data is used outside their scope.

Apparently we actually were debug-asserting on this since 1.0 and the only reason it's not come up is due to the precompiled stdlib thing.

the8472 commented 6 months ago

The new issue is about closing ReadDir, not for File or OwnedFd, they're different impls. But as noted there we probably should assert on EBADF instead of ignoring errors since it indicates a likely io-safety violation.