fs: Data loss while retrying `File::flush` when disk is full

Xuanwo commented 10 months ago

Version

:) cargo tree | grep tokio
│   │   └── tokio v1.36.0
│   │       └── tokio-macros v2.2.0 (proc-macro)
│   │   │   │   ├── tokio v1.36.0 (*)
│   │   │   │   ├── tokio-util v0.7.10
│   │   │   │   │   ├── tokio v1.36.0 (*)
│   │   │   │   ├── tokio v1.36.0 (*)
│   │   │   │   ├── tokio v1.36.0 (*)
│   │   │   │   └── tokio-rustls v0.24.1
│   │   │   │       └── tokio v1.36.0 (*)
│   │   │   ├── tokio v1.36.0 (*)
│   │   │   ├── tokio-rustls v0.24.1 (*)
│   │   │   ├── tokio-util v0.7.10 (*)
│   ├── tokio v1.36.0 (*)
├── tokio v1.36.0 (*)

Platform

Linux xuanwo-work 6.7.3-zen1-1-zen #1 ZEN SMP PREEMPT_DYNAMIC Thu, 01 Feb 2024 10:30:25 +0000 x86_64 GNU/Linux

Description

While addressing https://github.com/apache/opendal/issues/4058, we discovered that retrying File::flush while disk is full could result in data loss.

To reproduce:

Setup a small fs by:

fallocate -l 512K disk.img
mkfs disk.img

mkdir /tmp/test_dir
sudo mount -o loop disk.img /tmp/test_dir

sudo chmod a+wr /tmp/test_dir

Now we have a fs that only have 512K.

Running the follwing code:

use std::env;
use rand::prelude::*;
use tokio::io::AsyncWriteExt;
use tracing_subscriber;

#[tokio::main]
async fn main() -> Result<()> {
    tracing_subscriber::fmt::init();

    let path = &env::var("OPENDAL_FS_ROOT").expect("root must be set for this test");

    let mut f = tokio::fs::OpenOptions::new()
        .create(true)
        .write(true)
        .open(format!("{path}/test"))
        .await
        .unwrap();

    let size = thread_rng().gen_range(512 * 1024 + 1..4 * 1024 * 1024);
    let mut bs = vec![0; size];
    thread_rng().fill_bytes(&mut bs);

    f.write(&bs).await.unwrap();

    let res = f.flush().await;
    dbg!(&res);

    // After some operations, we retry the file flush.
    let res = f.flush().await;
    dbg!(&res);

    Ok(())
}

The full code example code be found at https://github.com/apache/opendal/pull/4141. I remove the opendal related code to make this example more readable.

The output is:

    Finished dev [unoptimized + debuginfo] target(s) in 0.79s
     Running `/home/xuanwo/Code/apache/opendal/core/target/debug/edge_file_close_with_retry_on_full_disk`
[edge/file_close_with_retry_on_full_disk/src/main.rs:48] &res = Err(
    Os {
        code: 28,
        kind: StorageFull,
        message: "No space left on device",
    },
)
[edge/file_close_with_retry_on_full_disk/src/main.rs:52] &res = Ok(
    (),
)

The first time, flush generates StorageFull which is expeceted. But the second time, the same flush call returns Ok.

I expected to see a StorageFull error instead.

The key problem here is:

Users can't recover from this error, even if they try removing other files. The previous write operation returned Ok.
Retrying the flush operation is permitted but risky. Data written may be lost forever once flush returns Ok.

Based on the code here:

https://github.com/tokio-rs/tokio/blob/63caced26f07240fa2751cefccee86cc342d3581/tokio/src/fs/file.rs#L887-L906

Maybe we should:

Perform the flush (write) again if we still have buf?
Update the last_write_err if the write operation failed?

I'm willing to give it a fix.

carllerche commented 9 months ago

I'm a bit confused. I'm less familiar w/ the details of FS ops.

The key problem here is:

Users can't recover from this error, even if they try removing other files. The previous write operation returned Ok.

Retrying the flush operation is permitted but risky. Data written may be lost forever once flush returns Ok.

The key problem w/ Tokio's impl or getting an err when calling flush in general?

Maybe you could explain how to handle StorageFull and flush w/ blocking std calls and where converting that blocking code to Tokio's fs api fails.

Xuanwo commented 9 months ago

I believe it's more like an issue of Tokio's implementation, which doesn't pass the write error on to the user.

I believe it's more like an issue of Tokio's implementation, which doesn't pass the write error on to the user.

Maybe you could explain how to handle StorageFull and flush w/ blocking std calls and where converting that blocking code to Tokio's fs api fails.

There is no direct mapping from std::File::flush to tokio::File::flush. std::File::flush on linux is a no-op, while tokio::File::flush involves to it's internal buffer logic.

I prepared a full repro here: https://github.com/Xuanwo/tokio-issue-6325-storage-full

let n = f.write(&bs).await?;
dbg!(&n);
assert_eq!(n, size, "tokio file always write data into buffer first");

While we calling write on a file, tokio will store it inside buf directly. After flush returns the write error, we cleaned it up and won't write again.

The same repro doesn't work on std::fs since std::fs will return the correct write size in f.write(). User will got the error while trying to write more data.

Xuanwo commented 9 months ago

During implement https://github.com/tokio-rs/tokio/pull/6330, I found that tokio will clear the buffer while error happened during write.

https://github.com/tokio-rs/tokio/blob/28d88cc35902a0489e9b0781a47f9f5713a01673/tokio/src/io/blocking.rs#L258-L265

I'm guessing we need to maintain the internal states here instead of droping all data?

tokio-rs / tokio

fs: Data loss while retrying `File::flush` when disk is full #6325