Closed br0adcast closed 6 months ago
That sounds weird. I just tried the code you provided on my local Linux laptop and it seems to work fine. Does this happen on raspberry pi only?
@wmanley Thoughts? Your PR was released in 1.6.0, so it's possible that it is related to this.
Does it fail on files shorter than 4096 bytes? Does it fail on files longer than 4096 bytes?
I've had a look through #3518 and I can't see how it might cause this issue. I'd been looking for a place where we incorrectly return 0
when we're not EOF and I can't see it.
@br0adcast: Could you try reverting 39706b198c6cacfb23544bad8b7c144471ee4efa and see if that fixes it?
I've tried to reproduce this on my Raspberry Pi with kernel 4.14, but it works fine for me. I've confirmed using strace
that it's correctly using preadv2
as well.
Looks like we need more info to repro.
I have not found this problem on the Linux systems I have, including the 32-bit ones i686 and 64-bit ARM. ~~But I found that this problem is reproduced since tokio 1.0 on my Raspberry. @wmanley so i think it is not 39706b1~~
How i reproduce it:
$ dd if=/dev/urandom of=file_450000 bs=1500 count=300
$ echo 1 | sudo tee /proc/sys/vm/drop_caches
$ ./target/release/fs_test_tokio_1.0 file_450000
File size = 4096
$ ./target/release/fs_test_tokio_0.2 file_450000
File size = 450000
$ ./target/release/fs_test_tokio_1.0 file_450000
File size = 450000
I've had a look through #3518 and I can't see how it might cause this issue. I'd been looking for a place where we incorrectly
return 0
when we're not EOF and I can't see it.@br0adcast: Could you try reverting 39706b1 and see if that fixes it?
Yes, reverting 39706b1 fixes the problem.
strace:
futex(0xb6f2f6f0, FUTEX_WAIT_PRIVATE, 1, NULL) = 0
preadv2(9, [{iov_base="\365Dh\200\316\3376n\7n!'2g9\215\363\31\t\0300O\346\2\205(\203G\261~\231d", iov_len=32}], 1, -1, RWF_NOWAIT) = 32
preadv2(9, [{iov_base="\30$\3074\234\277\226\206K\332\265\177\257\263\1\343\312\322\254J\23\376\202=<-\330\251\310_\6K", iov_len=32}], 1, -1, RWF_NOWAIT) = 32
preadv2(9, [{iov_base="\211\27Ax\227\362\2\25\20c\352Ai\256\302}N\244\245\6\253 \335\212G\1\277\5\336\322M\352"..., iov_len=64}], 1, -1, RWF_NOWAIT) = 64
preadv2(9, [{iov_base="\263\364S95\7\341\4\333\354\212U\265\230\\N\346\221\214h\327\275J\0f\215/\342\24/\204Z"..., iov_len=128}], 1, -1, RWF_NOWAIT) = 128
preadv2(9, [{iov_base="ool\33\v\365\16\214P\357\372F\rh\3473\23\351\347\201\213\336D\235\310i\257\3442\355\303\332"..., iov_len=256}], 1, -1, RWF_NOWAIT) = 256
preadv2(9, [{iov_base="}\370M\6[\240\260\250\376\312Nm\246\217\233u\275\242\201\316\341\22z\\\235A\366*\207\255\n\251"..., iov_len=512}], 1, -1, RWF_NOWAIT) = 512
preadv2(9, [{iov_base="[\23\210*\314\256\356u\246\260\246\242I\356\27\211O{,i\242\25\312\352\257\264\320\277\2255\305\352"..., iov_len=1024}], 1, -1, RWF_NOWAIT) = 1024
preadv2(9, [{iov_base="AAi\177\257:G!\225\362\3\2742t\256\303\363\178\277\266\f(\366\246\212\305\236qF/\350"..., iov_len=2048}], 1, -1, RWF_NOWAIT) = 2048
preadv2(9, [{iov_base="", iov_len=4096}], 1, -1, RWF_NOWAIT) = 0
write(1, "File size = 4096\n", 17File size = 4096
) = 17
close(9)
Hmm, seems like a kernel bug. What a shame - I guess #3518 will have to be reverted. What filesystem is the file on?
Hmm, seems like a kernel bug. What a shame - I guess #3518 will have to be reverted. What filesystem is the file on?
ext4
@br0adcast: Would you consider bisecting the kernel to find out when this broke?
@wmanley: The problem occurs after upgrading the kernel from 5.4.81 to 5.10.0 and higher. At least, I have found it in the following kernels:
I just reproduced a bug in an i686 virtual machine with Arch Linux 32 + lts kernel Linux 5.10.32-1.0-lts #1 SMP Thu, 22 Apr 2021 07:53:37 +0000 i686 GNU/Linux
I've tried to reproduce this with qemu by following https://ldpreload.com/blog/git-bisect-run . Code:
use tokio::io::AsyncReadExt;
use std::ffi::CStr;
fn main() -> std::io::Result<()> {
if std::process::id() == 1 {
//std::fs::create_dir("/dev")?;
if unsafe{libc::mount(
CStr::from_bytes_with_nul(b"devtmpfs\0").unwrap().as_ptr(),
CStr::from_bytes_with_nul(b"/dev\0").unwrap().as_ptr(),
CStr::from_bytes_with_nul(b"devtmpfs\0").unwrap().as_ptr(),
0, std::ptr::null())} != 0 {
return Err(std::io::Error::last_os_error());
};
}
let res = tokio::runtime::Builder::new_multi_thread()
.enable_all()
.build()
.unwrap()
.block_on(test());
match res {
Ok(true) => println!("Success\n"),
Ok(false) => println!("Failed\n"),
Err(e) => println!("Error: {:?}\n", e),
}
if std::process::id() == 1 {
unsafe{libc::reboot(libc::LINUX_REBOOT_CMD_POWER_OFF)};
}
Ok(())
}
async fn test() -> std::io::Result<bool> {
let mut file = tokio::fs::File::open("init").await?;
let mut contents = vec![];
file.read_to_end(&mut contents).await?;
println!("File size = {}", contents.len());
Ok(contents.len() != 4096)
}
Compile with:
cargo build --target=i686-unknown-linux-musl
And build initrd with:
cp ../tokio-test/target/i686-unknown-linux-musl/debug/tokio-test initrd/init &&
cd initrd &&
echo init | cpio -H newc -o | gzip > initrd.gz
Kernel built with:
git checkout v5.10.32
make ARCH=i386 CROSS_COMPILE=i686-linux-gnu- defconfig && make ARCH=i386 CROSS_COMPILE=i686-linux-gnu- -j12
Run with:
qemu-system-i386 -nographic -append console=ttyS0 -kernel arch/x86/boot/bzImage -initrd initrd/initrd.gz
The test passes. Maybe I need a real filesystem, and not just initrd.
Yes, it fails with ext4.
Created ext4 partition with file:
dd if=/dev/zero of=myextpart bs=1M count=1000
mkfs.ext4 myextpart
mkdir mnt
sudo mount -o loop myextpart $PWD/mnt
sudo dd if=/dev/urandom of=mnt/myfile bs=1M count=1
sudo umount mnt
Modified the test program to mount this fs and inspect this file:
use tokio::io::AsyncReadExt;
use std::ffi::CStr;
fn main() -> std::io::Result<()> {
if std::process::id() == 1 {
//std::fs::create_dir("/dev")?;
if unsafe{libc::mount(
CStr::from_bytes_with_nul(b"devtmpfs\0").unwrap().as_ptr(),
CStr::from_bytes_with_nul(b"/dev\0").unwrap().as_ptr(),
CStr::from_bytes_with_nul(b"devtmpfs\0").unwrap().as_ptr(),
0, std::ptr::null())} != 0 {
return Err(std::io::Error::last_os_error());
};
std::fs::create_dir("/myext")?;
if unsafe{libc::mount(
CStr::from_bytes_with_nul(b"/dev/sda\0").unwrap().as_ptr(),
CStr::from_bytes_with_nul(b"/myext\0").unwrap().as_ptr(),
CStr::from_bytes_with_nul(b"ext4\0").unwrap().as_ptr(),
0, std::ptr::null())} != 0 {
return Err(std::io::Error::last_os_error());
};
}
let res = tokio::runtime::Builder::new_multi_thread()
.enable_all()
.build()
.unwrap()
.block_on(test());
match res {
Ok(true) => println!("Success\n"),
Ok(false) => println!("Failed\n"),
Err(e) => println!("Error: {:?}\n", e),
}
if std::process::id() == 1 {
unsafe{libc::reboot(libc::LINUX_REBOOT_CMD_POWER_OFF)};
}
Ok(())
}
async fn test() -> std::io::Result<bool> {
let mut file = tokio::fs::File::open("/myext/myfile").await?;
let mut contents = vec![];
file.read_to_end(&mut contents).await?;
println!("File size = {}", contents.len());
Ok(contents.len() != 4096)
}
Run with partition attached as drive:
qemu-system-i386 -nographic -append console=ttyS0 -kernel arch/x86/boot/bzImage -initrd initrd/initrd.gz -drive file=myextpart
Now to bisect...
Bisecting with script bisect-test.sh
:
#!/bin/bash -e
make ARCH=i386 CROSS_COMPILE=i686-linux-gnu- defconfig &&
make ARCH=i386 CROSS_COMPILE=i686-linux-gnu- -j12 &&
qemu-system-i386 -nographic -append console=ttyS0 -kernel arch/x86/boot/bzImage -initrd initrd/initrd.gz -drive file=myextpart >output || exit 125
if grep -q ^Success output; then
exit 0
fi
if grep -q ^Failed output; then
exit 1
fi
exit 125
and commands:
git bisect start
git bisect bad v5.10
git bisect good v5.4
git bisect run ./bisect-test.sh
Points to https://github.com/torvalds/linux/commit/efa8480a831673bb52400df9dbe5da0aacda97bf: "fs: RWF_NOWAIT should imply IOCB_NOIO"
If that isn't a smoking gun then I don't know what is.
Hmm, I've just tested the recently released v5.12 and it succeeds there, so maybe it's already been fixed. I'll bisect to find the fix.
It was fixed in 5.11: https://github.com/torvalds/linux/commit/06c0444290cecf04c89c62e6d448b8461507d247 neither commit message mentions anything about bugs. The fix seems incidental.
I'll also run some tests to see if this is limited to 32-bit systems.
So given those commits, which kernel versions are affected? I'm guessing we can add a check for the kernel version on the first call and fall back to the old behavior on kernels with the bug.
If the 'preadv' function returns 0, we can try to do a regular read after that, to double check if it is an EOF or a kernel bug. This code in 'preadv2_safe' seems to work:
if 0 == bytes_read {
Err(std::io::Error::from_raw_os_error(libc::EAGAIN))
} else if bytes_read < 0 {
Err(std::io::Error::last_os_error())
} else {
/* preadv2 returns the number of bytes read, e.g. the number of bytes that have
* written into `unfilled`. So it's safe to assume that the data is now
* initialised */
dst.assume_init(bytes_read as usize);
dst.advance(bytes_read as usize);
Ok(())
}
I've just tested and I can confirm that the bug exists on x86-64 too, so it's not just a 32-bit issue.
So given those commits, which kernel versions are affected? I'm guessing we can add a check for the kernel version on the first call and fall back to the old behavior on kernels with the bug.
5.9 and 5.10 are broken according to my testing. Otherwise it works fine. I've only tested mainline linux, not any of the stable branches. I also can't speak to what's in the distro kernels. I'll run the test on a few stable kernels to see if they're affected too.
If the 'preadv' function returns 0, we can try to do a regular read after that, to double check if it is an EOF or a kernel bug.
That would certainly work around the bug, but would incur thread synchronisation overhead that #3518 was intended to avoid.
Stable kernel testing:
5.6.19: OK 5.7.19: OK 5.8.18: OK 5.9.16: BAD 5.10.38: BAD 5.11.22: OK
So there's nothing on the stable branch that affects this issue.
Regarding checking the kernel version, something like this should do the job:
fn has_buggy_preadv2() -> bool {
let mut uts : libc::utsname;
let res = unsafe{libc::uname(&mut uts as *mut utsname)};
if res < 0 {
// Getting kernel version failed, assume that it's buggy
return true;
}
uts.release[..4] == b"5.9." || uts.release[..5] == b"5.10."
}
But I would like to talk to the kernel devs first.
Sure, let me know what the kernel devs have to say.
I'm running into the same problem on my amd64 machine with kernel 5.10.0-6-amd64 (Debian 5.10.28-1) and (edit) using an XFS filesystem. I observed that when the file that is being read is already in the page cache (e.g. newly written files, or test by running md5sum
on the file) everything works as expected. However, when the file is not in the page cache (e.g. after dropping the page cache using sync; echo 1 > /proc/sys/vm/drop_caches
) only 4096 bytes get read.
Sure, let me know what the kernel devs have to say.
Here's the email: https://lore.kernel.org/linux-fsdevel/fea8b16d-5a69-40f9-b123-e84dcd6e8f2e@www.fastmail.com/T/#u . There's been no response yet, so I guess we should just work around it. I'll prepare a PR.
So how are we going to test for kernel versions?
@br0adcast: Would you mind changing the issue title to "fs::File reads first 4096 bytes only on Linux v5.9 and v5.10". That way it'll be easier for people to find this bug.
@carllerche by calling uname
(syscall)?
EDIT: Oh, I didn't see @Darksonn's PR, which is exactly doing that.
Sorry, I meant, how are we going to add multiple kernel versions to CI to test for kernel specific regressions in the future :)
Just to make sure everyone heard about it: We released 1.6.1
a few days ago, which does not have this bug. The 1.6.0
release was yanked.
@praseodym wonderful. It doesn't matter with 4096 bytes(I have verified it and use 32,64,128,,, 4096, they all return 0 if I drop cache before preadv2 with RWF_NOWAIT flag). Look kernel commit fs: RWF_NOWAIT should imply IOCB_NOIO, with the change allowing read-ahead for IOCB_NOWAIT, we changed the RWF_NOWAIT semantics of only doing cached reads. So if we use sync or echo 1 to drop cache interface, then we will get 0. I plan to write a regression test to ltp community for this.
Can this be closed now? The commit in question was reverted in 3fbcf1ba5077d31ac59ede386d1ea887b4c22e99
Version └── tokio v1.6.0 └── tokio-macros v1.2.0 (proc-macro)
Platform Raspberry Pi 32-bit Linux raspberrypi 5.10.17-v7l+ #1414 SMP Fri Apr 30 13:20:47 BST 2021 armv7l GNU/Linux
Description
After upgrading tokio from 1.5 to 1.6 only 4096 bytes can be read from the file
I tried this code:
and I get File size = 4096 for any file.