Closed Forza-tng closed 3 years ago
Also tried to reproduce, but failed either.
I tried both 4G fs on 10G device, both mkfs -b 4G
and mkfs -b 10G
then resize, neither can reproduce the same behavior.
Any extra info? Like the original fs size?
Hello @adam900710,
I managed to catch the kernel log file. The issues started at 4:11 am. The log file is 82GB so there were several million rows of the same kernel error message about attempt to read beyond end of drive.
At 01:32 am I issued this command btrfs filesystem resize 1:-4G /
. Then at about 4:11am the errors about accessing beyong end of device started:
Jul 29 01:32:53 e350 kernel: [459405.863906] BTRFS info (device sda3): resizing devid 1
Jul 29 01:32:53 e350 kernel: [459405.884731] BTRFS info (device sda3): resize device /dev/sda3 (devid 1) from 250686210048 to 246391242752
Jul 29 04:11:21 e350 kernel: [468908.080397] attempt to access beyond end of device
Jul 29 04:11:21 e350 kernel: [468908.080403] sda3: rw=3, want=498010111, limit=489621504
Jul 29 04:11:21 e350 kernel: [468908.080408] attempt to access beyond end of device
Jul 29 04:11:21 e350 kernel: [468908.080410] sda3: rw=3, want=506398718, limit=489621504
Jul 29 04:11:21 e350 kernel: [468908.080411] attempt to access beyond end of device
Jul 29 04:11:21 e350 kernel: [468908.080413] sda3: rw=3, want=514787325, limit=489621504
Jul 29 04:11:21 e350 kernel: [468908.080414] attempt to access beyond end of device
Jul 29 04:11:21 e350 kernel: [468908.080416] sda3: rw=3, want=523175932, limit=489621504
Jul 29 04:11:21 e350 kernel: [468908.080416] attempt to access beyond end of device
Jul 29 04:11:21 e350 kernel: [468908.080418] sda3: rw=3, want=531564539, limit=489621504
Jul 29 04:11:21 e350 kernel: [468908.080419] attempt to access beyond end of device
Jul 29 04:11:21 e350 kernel: [468908.080420] sda3: rw=3, want=539953146, limit=489621504
Jul 29 04:11:21 e350 kernel: [468908.080421] attempt to access beyond end of device
Jul 29 04:11:21 e350 kernel: [468908.080422] sda3: rw=3, want=548341753, limit=489621504
Jul 29 04:11:21 e350 kernel: [468908.080423] attempt to access beyond end of device
Jul 29 04:11:21 e350 kernel: [468908.080424] sda3: rw=3, want=556730360, limit=489621504
Jul 29 04:11:21 e350 kernel: [468908.080425] attempt to access beyond end of device
Jul 29 04:11:21 e350 kernel: [468908.080426] sda3: rw=3, want=565118967, limit=489621504
Jul 29 04:11:21 e350 kernel: [468908.080427] attempt to access beyond end of device
Jul 29 04:11:21 e350 kernel: [468908.080429] sda3: rw=3, want=573507574, limit=489621504
Jul 29 04:11:21 e350 kernel: [468908.080429] attempt to access beyond end of device
[... snip]
The log continues and continues for another 80GiB of rows and ends with:
[... snip]
Jul 29 05:12:46 e350 kernel: [472591.567052] sda3: rw=3, want=36028797466641925, limit=489621504
Jul 29 05:12:46 e350 kernel: [472591.567053] attempt to access beyond end of device
Jul 29 05:12:46 e350 kernel: [472591.567053] sda3: rw=3, want=36028797475030532, limit=489621504
Jul 29 05:12:46 e350 kernel: [472591.567053] attempt to access beyond end of device
Jul 29 05:12:46 e350 kernel: [472591.567054] sda3: rw=3, want=36028797483419139, limit=489621504
Jul 29 05:12:46 e350 kernel: [472591.567054] attempt to access beyond end of device
Jul 29 05:12:46 e350 kernel: [472591.567054] sda3: rw=3, want=36028797491807746, limit=489621504
Jul 29 05:12:46 e350 kernel: [472591.567054] attempt to access beyond end of device
Jul 29 05:12:46 e350 kernel: [472591.567055] sda3: rw=3, want=36028797500196353, limit=489621504
Jul 29 05:12:46 e350 kernel: [472591.567055] attempt to access beyond end of device
Jul 29 05:12:46 e350 kernel: [472591.567056] sda3: rw=2051, want=36028797500196864, limit=489621504
Jul 29 05:12:46 e350 kernel: [472591.567061] BTRFS warning (device sda3): failed to trim 1 device(s), last error -5
I added back the slack last night by doing btrfs filesystem resize 1:max /
because I was worried the slack space was the cause of the problems.
[... snip]
Jul 30 01:01:47 e350 kernel: [543890.813492] BTRFS info (device sda3): resizing devid 1
Jul 30 01:01:47 e350 kernel: [543890.891068] BTRFS info (device sda3): resize device /dev/sda3 (devid 1) from 246391242752 to 250686210048
According to my cron.log, fstrim runs exactly at 4:11am every day, except that it failed yesterday for /dev/sda3 (the root filesystem). So this definitely looks like a correlation with the kernel errors.
*** Tue, 28 Jul 2020 04:11:13 +0200 ***
/mnt/systemBoot: 185.9 MiB (194867200 bytes) trimmed on /dev/sda2
/: 46.4 GiB (49835204608 bytes) trimmed on /dev/sda3
*** Wed, 29 Jul 2020 04:11:08 +0200 ***
/mnt/systemBoot: 214.5 MiB (224882688 bytes) trimmed on /dev/sda2
*** Thu, 30 Jul 2020 04:11:06 +0200 ***
/mnt/systemBoot: 214.4 MiB (224808960 bytes) trimmed on /dev/sda2
/: 44.4 GiB (47641632768 bytes) trimmed on /dev/sda3
Yes, it's definitely fstrim causing the problem.
The kernel message has the biopf showing it's discard. rw=3
means REQ_OP_DISCARD
. Even the rw=2051
means it has REQ_FLAGs, if removing the REQ_FLAGs (by & (2 << 8 -1)
), it's still 3, so it's still discard.
Since you're using cron to run fstrim, so you're probably not using discard mount option, which means the only source of trimming is fstrim.
Furthermore, your dmesg shows failed to trim 1 device(s), last error -5
, then the error is from free device extent trimming code.
Now we have more clues than I thought. Maybe it's trim, resize, then trim causing the problem. I'll continue digging.
Thanks for your detailed reports, it really helps!
Bingo, it's exactly trim, resize, trim to trigger the bug.
I'll send out the fix soon.
Since you're using cron to run fstrim, so you're probably not using discard mount option, which means the only source of trimming is fstrim.
Correct. I opted for using fstrim in a cron job instead of discard=async mount option.
Thanks for your detailed reports, it really helps!
Thanks for helping! 🥇
Since you're the reporter, would you like to provide your name and mail address for the Reported-by
tag?
You need to use your real name though, so feel free if you want to keep anonymous. (https://www.kernel.org/doc/html/v4.10/process/submitting-patches.html#developer-s-certificate-of-origin-1-1 and https://www.kernel.org/doc/html/v4.10/process/submitting-patches.html#using-reported-by-tested-by-reviewed-by-suggested-by-and-fixes)
Since you're the reporter, would you like to provide your name and mail address for the
Reported-by
tag?Thanks, that's OK. :)
Feel free to verify the fix: https://patchwork.kernel.org/patch/11692799/
I applied the patch. Lets see how it goes. Thanks!
Update: no issues so far. 👍
Thanks for the report and fix, it's going to appear in stable sometime next week.
Hi, Recently i did a
btrfs fi resize 1:-4G /
on my single disk root filesystem.last night i hit a problem during the weekly fstrim . Seems fstrim tries to trim space in the slack area.
After residing to max, then fstrim worked normal again.
I'm not to keen to try to reproduce this on my root fs and i don't have another SSD. I'll see if i can use Virtualbox to simulate the same issue.
Gentoo Linux kernel 5.7.9, btrfs-progs 5.7, gcc 9.3.0. AMD Athlon 3000G CPU.