fs::File reads first 4096 bytes only on Linux v5.9 and v5.10

br0adcast commented 3 years ago

Version └── tokio v1.6.0 └── tokio-macros v1.2.0 (proc-macro)

Platform Raspberry Pi 32-bit Linux raspberrypi 5.10.17-v7l+ #1414 SMP Fri Apr 30 13:20:47 BST 2021 armv7l GNU/Linux

Description

After upgrading tokio from 1.5 to 1.6 only 4096 bytes can be read from the file

I tried this code:

let mut file = tokio::fs::File::open("/some/file").await?;
let mut contents = vec![];
file.read_to_end(&mut contents).await?;
println!("File size = {}", contents.len());

and I get File size = 4096 for any file.

Darksonn commented 3 years ago

That sounds weird. I just tried the code you provided on my local Linux laptop and it seems to work fine. Does this happen on raspberry pi only?

@wmanley Thoughts? Your PR was released in 1.6.0, so it's possible that it is related to this.

Darksonn commented 3 years ago

Does it fail on files shorter than 4096 bytes? Does it fail on files longer than 4096 bytes?

wmanley commented 3 years ago

I've had a look through #3518 and I can't see how it might cause this issue. I'd been looking for a place where we incorrectly return 0 when we're not EOF and I can't see it.

@br0adcast: Could you try reverting 39706b198c6cacfb23544bad8b7c144471ee4efa and see if that fixes it?

wmanley commented 3 years ago

I've tried to reproduce this on my Raspberry Pi with kernel 4.14, but it works fine for me. I've confirmed using strace that it's correctly using preadv2 as well.

carllerche commented 3 years ago

Looks like we need more info to repro.

br0adcast commented 3 years ago

I have not found this problem on the Linux systems I have, including the 32-bit ones i686 and 64-bit ARM. ~~But I found that this problem is reproduced since tokio 1.0 on my Raspberry. @wmanley so i think it is not 39706b1~~

How i reproduce it:

$ dd if=/dev/urandom of=file_450000 bs=1500 count=300
$ echo 1 | sudo tee /proc/sys/vm/drop_caches
$ ./target/release/fs_test_tokio_1.0 file_450000 
File size = 4096
$ ./target/release/fs_test_tokio_0.2 file_450000 
File size = 450000
$ ./target/release/fs_test_tokio_1.0 file_450000 
File size = 450000

br0adcast commented 3 years ago

I've had a look through #3518 and I can't see how it might cause this issue. I'd been looking for a place where we incorrectly return 0 when we're not EOF and I can't see it.

@br0adcast: Could you try reverting 39706b1 and see if that fixes it?

Yes, reverting 39706b1 fixes the problem.

strace:

futex(0xb6f2f6f0, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
preadv2(9, [{iov_base="\365Dh\200\316\3376n\7n!'2g9\215\363\31\t\0300O\346\2\205(\203G\261~\231d", iov_len=32}], 1, -1, RWF_NOWAIT) = 32
preadv2(9, [{iov_base="\30$\3074\234\277\226\206K\332\265\177\257\263\1\343\312\322\254J\23\376\202=<-\330\251\310_\6K", iov_len=32}], 1, -1, RWF_NOWAIT) = 32
preadv2(9, [{iov_base="\211\27Ax\227\362\2\25\20c\352Ai\256\302}N\244\245\6\253 \335\212G\1\277\5\336\322M\352"..., iov_len=64}], 1, -1, RWF_NOWAIT) = 64
preadv2(9, [{iov_base="\263\364S95\7\341\4\333\354\212U\265\230\\N\346\221\214h\327\275J\0f\215/\342\24/\204Z"..., iov_len=128}], 1, -1, RWF_NOWAIT) = 128
preadv2(9, [{iov_base="ool\33\v\365\16\214P\357\372F\rh\3473\23\351\347\201\213\336D\235\310i\257\3442\355\303\332"..., iov_len=256}], 1, -1, RWF_NOWAIT) = 256
preadv2(9, [{iov_base="}\370M\6[\240\260\250\376\312Nm\246\217\233u\275\242\201\316\341\22z\\\235A\366*\207\255\n\251"..., iov_len=512}], 1, -1, RWF_NOWAIT) = 512
preadv2(9, [{iov_base="[\23\210*\314\256\356u\246\260\246\242I\356\27\211O{,i\242\25\312\352\257\264\320\277\2255\305\352"..., iov_len=1024}], 1, -1, RWF_NOWAIT) = 1024
preadv2(9, [{iov_base="AAi\177\257:G!\225\362\3\2742t\256\303\363\178\277\266\f(\366\246\212\305\236qF/\350"..., iov_len=2048}], 1, -1, RWF_NOWAIT) = 2048
preadv2(9, [{iov_base="", iov_len=4096}], 1, -1, RWF_NOWAIT) = 0
write(1, "File size = 4096\n", 17File size = 4096
)      = 17
close(9)

wmanley commented 3 years ago

Hmm, seems like a kernel bug. What a shame - I guess #3518 will have to be reverted. What filesystem is the file on?

br0adcast commented 3 years ago

Hmm, seems like a kernel bug. What a shame - I guess #3518 will have to be reverted. What filesystem is the file on?

ext4

wmanley commented 3 years ago

@br0adcast: Would you consider bisecting the kernel to find out when this broke?

br0adcast commented 3 years ago

@wmanley: The problem occurs after upgrading the kernel from 5.4.81 to 5.10.0 and higher. At least, I have found it in the following kernels:

Linux raspberrypi 5.10.0-v7l+ #1382 SMP Tue Dec 15 18:23:34 GMT 2020 armv7l GNU/Linux
Linux raspberrypi 5.10.17-v7l+ #1414 SMP Fri Apr 30 13:20:47 BST 2021 armv7l GNU/Linux
Linux raspberrypi 5.10.36-v7l+ #1418 SMP Thu May 13 18:17:17 BST 2021 armv7l GNU/Linux

br0adcast commented 3 years ago

I just reproduced a bug in an i686 virtual machine with Arch Linux 32 + lts kernel Linux 5.10.32-1.0-lts #1 SMP Thu, 22 Apr 2021 07:53:37 +0000 i686 GNU/Linux

wmanley commented 3 years ago

I've tried to reproduce this with qemu by following https://ldpreload.com/blog/git-bisect-run . Code:

use tokio::io::AsyncReadExt;
use std::ffi::CStr;

fn main() -> std::io::Result<()> {

    if std::process::id() == 1 {
        //std::fs::create_dir("/dev")?;
        if unsafe{libc::mount(
                CStr::from_bytes_with_nul(b"devtmpfs\0").unwrap().as_ptr(),
                CStr::from_bytes_with_nul(b"/dev\0").unwrap().as_ptr(),
                CStr::from_bytes_with_nul(b"devtmpfs\0").unwrap().as_ptr(),
                0, std::ptr::null())} != 0 {
            return Err(std::io::Error::last_os_error());
        };
    }

    let res = tokio::runtime::Builder::new_multi_thread()
        .enable_all()
        .build()
        .unwrap()
        .block_on(test());

    match res {
        Ok(true) => println!("Success\n"),
        Ok(false) => println!("Failed\n"),
        Err(e) => println!("Error: {:?}\n", e),
    }

    if std::process::id() == 1 {
        unsafe{libc::reboot(libc::LINUX_REBOOT_CMD_POWER_OFF)};
    }
    Ok(())
}

async fn test() -> std::io::Result<bool> {
    let mut file = tokio::fs::File::open("init").await?;
    let mut contents = vec![];
    file.read_to_end(&mut contents).await?;
    println!("File size = {}", contents.len());

    Ok(contents.len() != 4096)
}

Compile with:

cargo build --target=i686-unknown-linux-musl

And build initrd with:

cp ../tokio-test/target/i686-unknown-linux-musl/debug/tokio-test initrd/init &&
cd initrd &&
echo init | cpio -H newc -o | gzip > initrd.gz

Kernel built with:

git checkout v5.10.32
make ARCH=i386 CROSS_COMPILE=i686-linux-gnu- defconfig && make ARCH=i386 CROSS_COMPILE=i686-linux-gnu- -j12

Run with:

qemu-system-i386 -nographic -append console=ttyS0 -kernel arch/x86/boot/bzImage -initrd initrd/initrd.gz

The test passes. Maybe I need a real filesystem, and not just initrd.

wmanley commented 3 years ago

Yes, it fails with ext4.

Created ext4 partition with file:

dd if=/dev/zero of=myextpart bs=1M count=1000
mkfs.ext4 myextpart
mkdir mnt
sudo mount -o loop myextpart $PWD/mnt
sudo dd if=/dev/urandom of=mnt/myfile bs=1M count=1
sudo umount mnt

Modified the test program to mount this fs and inspect this file:

use tokio::io::AsyncReadExt;
use std::ffi::CStr;

fn main() -> std::io::Result<()> {

    if std::process::id() == 1 {
        //std::fs::create_dir("/dev")?;
        if unsafe{libc::mount(
                CStr::from_bytes_with_nul(b"devtmpfs\0").unwrap().as_ptr(),
                CStr::from_bytes_with_nul(b"/dev\0").unwrap().as_ptr(),
                CStr::from_bytes_with_nul(b"devtmpfs\0").unwrap().as_ptr(),
                0, std::ptr::null())} != 0 {
            return Err(std::io::Error::last_os_error());
        };
        std::fs::create_dir("/myext")?;
        if unsafe{libc::mount(
                CStr::from_bytes_with_nul(b"/dev/sda\0").unwrap().as_ptr(),
                CStr::from_bytes_with_nul(b"/myext\0").unwrap().as_ptr(),
                CStr::from_bytes_with_nul(b"ext4\0").unwrap().as_ptr(),
                0, std::ptr::null())} != 0 {
            return Err(std::io::Error::last_os_error());
        };
    }

    let res = tokio::runtime::Builder::new_multi_thread()
        .enable_all()
        .build()
        .unwrap()
        .block_on(test());

    match res {
        Ok(true) => println!("Success\n"),
        Ok(false) => println!("Failed\n"),
        Err(e) => println!("Error: {:?}\n", e),
    }

    if std::process::id() == 1 {
        unsafe{libc::reboot(libc::LINUX_REBOOT_CMD_POWER_OFF)};
    }
    Ok(())
}

async fn test() -> std::io::Result<bool> {
    let mut file = tokio::fs::File::open("/myext/myfile").await?;
    let mut contents = vec![];
    file.read_to_end(&mut contents).await?;
    println!("File size = {}", contents.len());

    Ok(contents.len() != 4096)
}

Run with partition attached as drive:

qemu-system-i386 -nographic -append console=ttyS0 -kernel arch/x86/boot/bzImage -initrd initrd/initrd.gz -drive file=myextpart

Now to bisect...

wmanley commented 3 years ago

Bisecting with script bisect-test.sh:

#!/bin/bash -e

make ARCH=i386 CROSS_COMPILE=i686-linux-gnu- defconfig &&
make ARCH=i386 CROSS_COMPILE=i686-linux-gnu- -j12 &&
qemu-system-i386 -nographic -append console=ttyS0 -kernel arch/x86/boot/bzImage -initrd initrd/initrd.gz -drive file=myextpart >output || exit 125

if grep -q ^Success output; then
    exit 0
fi

if grep -q ^Failed output; then
    exit 1
fi

exit 125

and commands:

git bisect start
git bisect bad v5.10
git bisect good v5.4
git bisect run ./bisect-test.sh

Points to https://github.com/torvalds/linux/commit/efa8480a831673bb52400df9dbe5da0aacda97bf: "fs: RWF_NOWAIT should imply IOCB_NOIO"

If that isn't a smoking gun then I don't know what is.

Hmm, I've just tested the recently released v5.12 and it succeeds there, so maybe it's already been fixed. I'll bisect to find the fix.

wmanley commented 3 years ago

It was fixed in 5.11: https://github.com/torvalds/linux/commit/06c0444290cecf04c89c62e6d448b8461507d247 neither commit message mentions anything about bugs. The fix seems incidental.

I'll also run some tests to see if this is limited to 32-bit systems.

Darksonn commented 3 years ago

So given those commits, which kernel versions are affected? I'm guessing we can add a check for the kernel version on the first call and fall back to the old behavior on kernels with the bug.

br0adcast commented 3 years ago

If the 'preadv' function returns 0, we can try to do a regular read after that, to double check if it is an EOF or a kernel bug. This code in 'preadv2_safe' seems to work:

if 0 == bytes_read {
    Err(std::io::Error::from_raw_os_error(libc::EAGAIN))
} else if bytes_read < 0 {
    Err(std::io::Error::last_os_error())
} else {
    /* preadv2 returns the number of bytes read, e.g. the number of bytes that have
     * written into `unfilled`. So it's safe to assume that the data is now
     * initialised */
    dst.assume_init(bytes_read as usize);
    dst.advance(bytes_read as usize);
    Ok(())
}

wmanley commented 3 years ago

I've just tested and I can confirm that the bug exists on x86-64 too, so it's not just a 32-bit issue.

So given those commits, which kernel versions are affected? I'm guessing we can add a check for the kernel version on the first call and fall back to the old behavior on kernels with the bug.

5.9 and 5.10 are broken according to my testing. Otherwise it works fine. I've only tested mainline linux, not any of the stable branches. I also can't speak to what's in the distro kernels. I'll run the test on a few stable kernels to see if they're affected too.

If the 'preadv' function returns 0, we can try to do a regular read after that, to double check if it is an EOF or a kernel bug.

That would certainly work around the bug, but would incur thread synchronisation overhead that #3518 was intended to avoid.

wmanley commented 3 years ago

Stable kernel testing:

5.6.19: OK 5.7.19: OK 5.8.18: OK 5.9.16: BAD 5.10.38: BAD 5.11.22: OK

So there's nothing on the stable branch that affects this issue.

Regarding checking the kernel version, something like this should do the job:

fn has_buggy_preadv2() -> bool {
    let mut uts : libc::utsname;
    let res = unsafe{libc::uname(&mut uts as *mut utsname)};
    if res < 0 {
        // Getting kernel version failed, assume that it's buggy
        return true;
    }
    uts.release[..4] == b"5.9." || uts.release[..5] == b"5.10." 
}

But I would like to talk to the kernel devs first.

Darksonn commented 3 years ago

Sure, let me know what the kernel devs have to say.

praseodym commented 3 years ago

I'm running into the same problem on my amd64 machine with kernel 5.10.0-6-amd64 (Debian 5.10.28-1) and (edit) using an XFS filesystem. I observed that when the file that is being read is already in the page cache (e.g. newly written files, or test by running md5sum on the file) everything works as expected. However, when the file is not in the page cache (e.g. after dropping the page cache using sync; echo 1 > /proc/sys/vm/drop_caches) only 4096 bytes get read.

wmanley commented 3 years ago

Sure, let me know what the kernel devs have to say.

Here's the email: https://lore.kernel.org/linux-fsdevel/fea8b16d-5a69-40f9-b123-e84dcd6e8f2e@www.fastmail.com/T/#u . There's been no response yet, so I guess we should just work around it. I'll prepare a PR.

carllerche commented 3 years ago

So how are we going to test for kernel versions?

wmanley commented 3 years ago

@br0adcast: Would you mind changing the issue title to "fs::File reads first 4096 bytes only on Linux v5.9 and v5.10". That way it'll be easier for people to find this bug.

cynecx commented 3 years ago

@carllerche by calling uname (syscall)?

EDIT: Oh, I didn't see @Darksonn's PR, which is exactly doing that.

carllerche commented 3 years ago

Sorry, I meant, how are we going to add multiple kernel versions to CI to test for kernel specific regressions in the future :)

Darksonn commented 3 years ago

Just to make sure everyone heard about it: We released 1.6.1 a few days ago, which does not have this bug. The 1.6.0 release was yanked.

xuyang0410 commented 3 years ago

@praseodym wonderful. It doesn't matter with 4096 bytes(I have verified it and use 32,64,128,,, 4096, they all return 0 if I drop cache before preadv2 with RWF_NOWAIT flag). Look kernel commit fs: RWF_NOWAIT should imply IOCB_NOIO, with the change allowing read-ahead for IOCB_NOWAIT, we changed the RWF_NOWAIT semantics of only doing cached reads. So if we use sync or echo 1 to drop cache interface, then we will get 0. I plan to write a regression test to ltp community for this.

sameer commented 6 months ago

Can this be closed now? The commit in question was reverted in 3fbcf1ba5077d31ac59ede386d1ea887b4c22e99

tokio-rs / tokio

fs::File reads first 4096 bytes only on Linux v5.9 and v5.10 #3803