After "apt upgrade" every other boot fails

ViRb3 commented 2 months ago

Describe the bug

I ran apt update && apt upgrade today. Since then, approximately every other reboot fails (50% chance) with the following error:

Started systemd-udevd.service - Rule-based Manager for Device Events and Files.
[ TIME ] Timed out waiting for device dev-disk-by\2xdpartuuid-4e639091\x2d01.device - /dev/disk/by-partuuid/4e639091-01.
[DEPEND] Dependency failed for failed for boot-firmware.mount - /boot/firmware.
[DEPEND] Dependency failed for local-fs.target - Local File Systems.

I was unfortunate to update 3 of my Raspberry Pi 5s, and every single one experiences this issue.

Steps to reproduce the behaviour

Run apt update && apt upgrade
Repeat reboot until reproduced

Device (s)

Raspberry Pi 5

System

$ cat /etc/rpi-issue
Raspberry Pi reference 2023-12-11
Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 2acf7afcba7d11500313a7b93bb55a2aae20b2d6, stage2

$ vcgencmd version
2024/07/30 15:25:46
Copyright (c) 2012 Broadcom
version 790da7ef (release) (embedded)

$ uname -a
Linux server 6.6.47+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.47-1+rpt1 (2024-09-02) aarch64 GNU/Linux

Logs

No response

Additional context

No response

pelwell commented 2 months ago

Which medium are these Pis booting from?

ViRb3 commented 2 months ago

@pelwell Micro SD card.

P33M commented 2 months ago

What sort of micro sd card?

ViRb3 commented 2 months ago

It's a SanDisk Extreme Pro 128GB on all three of the devices. For what is worth, I also have a bunch of external HDDs attached to each Pi, however those have their own power supply and are connected via a powered hub.

I also use an initramfs with dropbear for rootfs decryption according to https://github.com/ViRb3/pi-encrypted-boot-ssh.

This setup hasn't ever caused issues until yesterday.

pelwell commented 2 months ago

Is your debug output above from a serial port? Or rather, do you have a Pi 5-compatible serial cable you can use to capture debug output? If so, try with ignore_loglevel in cmdline.txt (keeping it to a singe line) - there should be some messages about the SD card, but whatever you get, post it all.

ViRb3 commented 2 months ago

These logs were taken from the HDMI output, I just wrote down the important lines by hand.

I do actually have a debug probe, I hooked it up for the past 2 days, but unfortunately even after 30+ restarts at various times of the day and intervals, I could not reproduce the issue.

I thought maybe it was a fluke, so I restarted my other two Raspberry Pis to test, and surprisingly both of then failed to boot exactly like before. Sadly, those two devices are remote, so I can't attach a probe to them.

I would like to mention, when this issue occurred before, I could enter the initramfs/dropbear environment, but after decrypting the rootfs, the mounting timeout would happen.

In my last attempt on the remote Raspberry Pis, I did apt update && apt upgrade && rpi-eeprom-update -a, and when I rebooted, the Raspberry Pi never came online - the green light was on, but even initramfs/dropbear was not reachable. A force power cycle fixed it for now.

Sorry I can't provide any concrete details, but since the day the Raspberry Pi 5s have been released, I have been running these 3 in the exact same setup with no issues. So while I can't tell what's wrong, something definitely seems off. Thanks!

timg236 commented 2 months ago

Is it possible that you need to regenerate the initramfs following the kernel update?

ViRb3 commented 2 months ago

It auto-regenerates during every update, and I explicitly looked for those lines to make sure it didn't go wrong as it was my first suspect as well. I also did manually regenerate, just to double-check.

ViRb3 commented 2 months ago

I would also like to add, I have updated all 3 devices in this exact way at least 10 times since the Raspberry Pi 5 was released and never experienced an issue.

Here's ls of initrd file:

# ls -alt /boot/initrd*
-rw------- 1 root root 13890585 Sep 13 18:36 /boot/initrd.img-6.6.31+rpt-rpi-2712
-rw------- 1 root root 13888365 Sep 13 18:35 /boot/initrd.img-6.6.31+rpt-rpi-v8
-rw------- 1 root root 14076413 Sep 13 18:35 /boot/initrd.img-6.6.47+rpt-rpi-2712
-rw------- 1 root root 14070518 Sep 13 18:35 /boot/initrd.img-6.6.47+rpt-rpi-v8

Actually, one more thing I forgot to mention, I am running the kernel in 4KB addressing mode instead of 16KB for compatibility reasons with my external disks, which are formatted with BTRFS and 4KB sectors.

ViRb3 commented 2 months ago

One of my remote Raspberry Pi 5s just hung up out of the blue. It did not reboot, the green light was still on, but it was completely unresponsive - could not SSH, no disk activity, etc. After force restarting it, I checked the logs, and there was absolutely nothing since it crashed ~8 hours ago. Last logs were just some routine networking related cron jobs I have running.

I have now downgraded all my devices to the previous kernel, 6.6.31, will report back how it goes. For reference, this is how I did it:

set -e
cd /boot/firmware
mv kernel8.img kernel8.img.bak
mv kernel_2712.img kernel_2712.img.bak
mv initramfs_2712 initramfs_2712.bak
mv initramfs8 initramfs8.bak
cp -a ../vmlinuz-6.6.31+rpt-rpi-2712 kernel_2712.img
cp -a ../vmlinuz-6.6.31+rpt-rpi-v8 kernel8.img
cp -a ../initrd.img-6.6.31+rpt-rpi-2712 initramfs_2712
cp -a ../initrd.img-6.6.31+rpt-rpi-v8 initramfs8

Reference of kernel versions: https://archive.raspberrypi.com/debian/pool/main/l/linux/

ViRb3 commented 2 months ago

Since downgrading, have restarted the servers multiple times and had no issues. No random crashes either. I am therefore very confident that this was due to the kernel upgrade. Sadly, not sure how I can help more.

pelwell commented 1 month ago

The major change to the SD card handling has been the addition of support for command queueing. I'm going to list a number of kernel builds in order of incrementing time. For each build, run sudo rpi-update <hash>, where <hash> is the 7-digit hexadecimal number on the line. I suggest you start at the top and work your way down until you start to see the problem again.

The builds: f17defc - "kernel: Bump to 6.6.22" // before command queueing was enabled f637152 - "kernel: drivers: mmc: add SD support for Command Queueing" 8ec753c - "kernel: SD command queueing take 2" b0bbbab - "kernel: SD fixes round 3"

fshimizu commented 1 month ago

I think I have the same issue on my RPi 5 on 6.6.47. OS and all data are on a SanDisk Extreme 256 GB SD card. Sometimes it hangs at boot, but for me it's less than 50% of the time.

Additionally, sometimes during normal system runtime, the whole system suddenly freezes. Because I have a system monitor open at all times, I always see disk I/O of my SD card completely stop when that happens. From one moment to the next there is not a single byte disk I/O to and from the SD card any longer. And from here on the system completely freezes and needs a hard power cycle. For me this happens much more frequently when there is a lot of I/O going on, such as pulling large container images. For example, there is one particular image that's almost 2 GB which I tried to pull dozens of times last week and each time the system froze eventually. It can also happen during normal desktop use, e.g. while browsing with Firefox or using vscode, though less frequently.

I can reasonably rule out a faulty SD card because I bought a new SanDisk Extreme 128 GB card, cloned the OS to it (yes, I did properly downsize partitions) and had the exact same issue with that.

Just now I downgraded to 6.6.22 using rpi-update f17defc and was able to pull that large image for the first time without freeze. So I have no long-term experience yet, but it looks very promising. Thank you so much for the research. I almost thought my Pi was faulty until I found this issue.

ViRb3 commented 1 month ago

Thanks for chiming in @fshimizu! Happy to hear this is not an isolated case. I have been meaning to troubleshoot more, but sadly have a busy week ahead. I hope I can help more after that.

Can I ask if you have any external disks connected? When you mention that I/O causes the crash, do you mean load on the SD card or any external disk?

fshimizu commented 1 month ago

@ViRb3 Happy to try and clarify further. No external disks. This Pi is only for experimenting and a bit of desktop use. So there are peripherals and an audio device connected via USB, but the only storage is the SD card. So, all I/O is on the SD card.

you mention that I/O causes the crash

Let me be careful here: I can't really confirm that I/O directly causes the crash. It's just an observation I made that for me it seems to happen much more often during times of heavy I/O. But not exclusively.

I tried to get more info, e.g. by running dmesg right after the freeze happens, but the command simply hangs forever - that is if the input still works. After a power cycle there seems to be nothing helpful in the logs. I guess if there are any errors logged, they can't be written. Unfortunately I only just started using Pis, so I don't have a serial cable or anything for debugging. But I'll try to provide any info that could help.

ViRb3 commented 1 month ago

Thanks for the update. @pelwell posted above some commit hashes of various major changes in the kernel. Can you follow the instructions and test which one does and doesn't crash? I'm away from my Pis, so I don't dare to touch them as I can't do anything if they hang.

fshimizu commented 1 month ago

This is certainly not exhaustive but I installed the kernels listed and did the following test procedure, which had previously frozen my system on 6.6.47 reliably.

Start a few unrelated containers to give the system some background load
podman rmi <BIG_IMAGE>
podman pull <BIG_IMAGE>
Start container from BIG_IMAGE
Stop and delete container
Repeat once from 2.

These are the results.

Test #	Commit	Comment	uname -a	Test result
1	f17defc	"kernel: Bump to 6.6.22" // before command queueing was enabled	`Linux <HOSTNAME> 6.6.22-v8-16k+ #1745 SMP PREEMPT Tue Mar 19 17:57:20 GMT 2024 aarch64 GNU/Linux`	seems stable
2	f637152	"kernel: drivers: mmc: add SD support for Command Queueing"	`Linux <HOSTNAME> 6.6.22-v8-16k+ #1746 SMP PREEMPT Wed Mar 20 17:21:57 GMT 2024 aarch64 GNU/Linux`	seems stable
3	8ec753c	"kernel: SD command queueing take 2"	`Linux <HOSTNAME> 6.6.22-v8-16k+ #1747 SMP PREEMPT Wed Mar 27 17:06:56 GMT 2024 aarch64 GNU/Linux`	seems stable
4	b0bbbab	"kernel: SD fixes round 3"	`Linux <HOSTNAME> 6.6.25-v8-16k+ #1752 SMP PREEMPT Wed Apr 10 14:30:55 BST 2024 aarch64 GNU/Linux`	seems stable
5	?	My previous kernel from the distribution	`Linux <HOSTNAME> 6.6.47+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.47-1+rpt1 (2024-09-02) aarch64 GNU/Linux`	Froze at step 1.
6	f17defc	"kernel: Bump to 6.6.22" // before command queueing was enabled	`Linux <HOSTNAME> 6.6.22-v8-16k+ #1745 SMP PREEMPT Tue Mar 19 17:57:20 GMT 2024 aarch64 GNU/Linux`	seems stable

Test 5 was a control with the regular kernel installed by the distro which had been freezing for me. On the first try I didn't even get to the main test, pulling the image. The system already froze while just starting some other, smaller containers.

Test 6 was another control to verify that it would work again with the older kernel.

P33M commented 1 month ago

On the version with the crash/freeze, can you add dtparam=sd_cqe=off to /boot/firmware/config.txt and retest?

fshimizu commented 1 month ago

@P33M thanks a lot for that suggestion. You might be onto something here. I've done the same test as before, with and without the dtparam=sd_cqe=off parameter. I.e. two rounds each of pulling a large container image, starting a container from it and deleting it again.

All tests were done with the kernel from the distribution where I initially noticed the issue: Linux <HOSTNAME> 6.6.47+rpt-rpi-2712 #1 SMP PREEMPT Debian 1:6.6.47-1+rpt1 (2024-09-02) aarch64 GNU/Linux

Test #	with or without `dtparam=sd_cqe=off`?	result
1	without	froze during image pull, 1st round
2	with	seems stable
3	without	froze during container creation, 1st round
4	with	seems stable

I will keep using the problematic 6.6.47 kernel with the dtparam=sd_cqe=off option for now, to test this a bit more long-term.

fshimizu commented 1 month ago

To hopefully rule out some other factors, here are some details on @ViRb3's setup and my settings in comparison.

I also have a bunch of external HDDs attached

I have no storage other than the SD card.

I also use an initramfs with dropbear for rootfs decryption

I use neither storage encryption nor dropbear.

I am running the kernel in 4KB addressing mode instead of 16KB

I haven't changed that, so I use whatever is the default in a fresh installation.

P33M commented 1 month ago

For each of the cards where boot fails without sd_cqe=off, can you run and post the output of find /sys/bus/mmc/devices/mmc0\:*/ -type f -not -path "*block*" -print -exec cat {} \; ?

fshimizu commented 1 month ago

Here you go: device-info.txt

At the moment I only have this one SD card. The other one I had cloned from it previously is already overwritten.

ViRb3 commented 1 month ago

Here are my cards that crashed, they're all SanDisk Extreme Pro 128GB purchased from amazon.co.uk:

card1.txt card2.txt card3.txt

P33M commented 1 month ago

I've managed to hang a card I previously thought was reliable - Sandisk Extreme Pro 64GB with a date of 09/2021. Running fstrim / sync /fio in parallel managed to wedge the card in about 20 minutes.

P33M commented 1 month ago

I can reproduce this on our own cards - it smells a lot like a deadlock, as there are no IO or CQE engine errors but eventually all tasks that were accessing the mmc device report scheduler timeouts.

I was operating under the assumption that async and sync mmc_blk commands had sufficient guards to prevent doing sync while async requests were pending, but perhaps not.

P33M commented 1 month ago

I have a working theory - it's sort of a consequence of limiting posted writes to 1, and a poor implementation of CQHCI's halt mechanism.

When halted, the CQE can stop after it finishes its current task descriptor which defines a contiguous data block transfer. Such a descriptor may be in a chain of descriptors for a large block write associated with a tag (this is key - the pending write needs to be >128K in length). Issuing an SD cache flush command halts the CQE, and the Linux CQHCI driver relies on more requests being submitted to restart the CQE.

Edit: the mmc_blk driver does attempt to avoid issuing a flush while IO is ongoing, by waiting for CQE to be idle, but it does this in process context. There's the possibility of preemption and something restarting CQE by submitting a (write) request.

It follows that if the block layer queue is stalled, either because it is jammed full of writes (that can't get posted due to the 1x write limit), or the OS needs to serialise a critical write and issued the flush along with the write, then activity stops.

There's also another potential hole where halting the CQE prevents automatic card status polling from occurring, so if a critical read task was submitted and a flush occurred before the card reported ready, you also could get wedged.

P33M commented 1 month ago

I can now get CQE to halt cleanly after finishing all its current tasks, but things still fall over because of the MMC framework's multiple interlocking mechanisms for declaring "I'm exclusively using the host".

Now at least I get timeouts attributed to the MMC layer, but forward progress for IO still stops.

jfikar commented 1 month ago

Hi, I was reporting problems with CQE and SanDisk Extreme Pro 128GB and SanDisk Extreme 64GB already during the testing. I could not find the cause at that time, but maybe now I found something:

Using swap partition (instead of the default swap file) on the SD card together with CQE results in failed boots in 8/10 cases. Fstab look like:

PARTUUID=5e07cda8-03  none            swap    sw,discard,pri=10                               0 0

No partition swap - system boots with CQE in 10/10 cases, sometimes some individual files are read incorrectly. I mean, there is no error reading the file, but checksums do not match. Disabling CQE cures all the errors.

With the new kernel (6.6.51-1+rpt3 (2024-10-08), which changes the default CQE behavior to off, I sometimes observe CQE recovery in dmesg. Maybe the developers enabled some kind of CQE debugging? This is without the swap partition during checksumming the files on the SD card. I believe that CQE recovery somehow reads the files just fine (no error in checksum), but it is slightly slower.

It looks like this:

[Oct12 23:12] mmc0: cqhci: timeout for tag 30, qcnt 1
[  +0.000006] mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
[  +0.000002] mmc0: cqhci: Caps:      0x000030c8 | Version:  0x00000510
[  +0.000003] mmc0: cqhci: Config:    0x00000101 | Control:  0x00000000
[  +0.000002] mmc0: cqhci: Int stat:  0x00000000 | Int enab: 0x00000016
[  +0.000002] mmc0: cqhci: Int sig:   0x00000016 | Int Coal: 0x00000000
[  +0.000002] mmc0: cqhci: TDL base:  0x05a48000 | TDL up32: 0x00000001
[  +0.000003] mmc0: cqhci: Doorbell:  0x40000000 | TCN:      0x00000000
[  +0.000002] mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x40000000
[  +0.000002] mmc0: cqhci: Task clr:  0x00000000 | SSC1:     0x00004000
[  +0.000002] mmc0: cqhci: SSC2:      0x0000aaaa | DCMD rsp: 0x00000000
[  +0.000001] mmc0: cqhci: RED mask:  0xfdf9a080 | TERRI:    0x00000000
[  +0.000002] mmc0: cqhci: Resp idx:  0x0000000d | Resp arg: 0x00000000
[  +0.000006] mmc0: running CQE recovery

[Oct12 23:13] mmc0: cqhci: timeout for tag 2, qcnt 1
[  +0.000007] mmc0: cqhci: ============ CQHCI REGISTER DUMP ===========
[  +0.000002] mmc0: cqhci: Caps:      0x000030c8 | Version:  0x00000510
[  +0.000002] mmc0: cqhci: Config:    0x00000101 | Control:  0x00000000
[  +0.000003] mmc0: cqhci: Int stat:  0x00000000 | Int enab: 0x00000016
[  +0.000002] mmc0: cqhci: Int sig:   0x00000016 | Int Coal: 0x00000000
[  +0.000002] mmc0: cqhci: TDL base:  0x05a48000 | TDL up32: 0x00000001
[  +0.000003] mmc0: cqhci: Doorbell:  0x00000004 | TCN:      0x00000000
[  +0.000002] mmc0: cqhci: Dev queue: 0x00000000 | Dev Pend: 0x00000004
[  +0.000002] mmc0: cqhci: Task clr:  0x00000000 | SSC1:     0x00004000
[  +0.000002] mmc0: cqhci: SSC2:      0x0000aaaa | DCMD rsp: 0x00000000
[  +0.000002] mmc0: cqhci: RED mask:  0xfdf9a080 | TERRI:    0x00000000
[  +0.000002] mmc0: cqhci: Resp idx:  0x0000000d | Resp arg: 0x00000000
[  +0.000006] mmc0: running CQE recovery

Does it mean that my SD card is faulty with CQE enebled?

I have just received the new official Raspberry Pi A2 SD card. I'll try that one as well.

P33M commented 1 month ago

The posted write count underflows which stalls IO submission forevermore. I have a working theory as to why - although some members of the mmc_queue struct are accessed under spinlock, other members aren't. If there's a read-modify-write on a member in the same cacheline as a critical variable, then this tramples the data that was "protected". The clue was that the hang typically coincided with an FIO process spawning, and never during one of the IO sequences. I've added lots of gratuitous padding to these struct members and so far no hang... It still hangs, and the above is incorrect.

P33M commented 1 month ago

In about 30 minutes after auto-build completes, please try testing with sudo rpi-update pulls/6418, re-enabling sd_cqe in config.txt via dtparam=sd_cqe=on and rebooting. If you were previously getting cqhci timeout errors with register dumps, this would indicate an incompatible card and will still need to leave sd_cqe off.

fshimizu commented 1 month ago

Thanks a lot for your efforts, @P33M. I'll test when I get some time. One question ahead of time:

If you were previously getting cqhci timeout errors with register dumps, this would indicate an incompatible card

Is there an easy way how I can check that without any serial cable or other debug setup while testing? I will do the same tests as I did previously in any case. Just wondering if I can get more info out of it, than simply "hangs" or "probably doesn't hang".

fshimizu commented 1 month ago

After a round of my usual amateurish tests with the test build and with my old kernel from the distro I'm carefully optimistic. dtparam=sd_cqe=on was set at all times.

Build pulls/6418 (6.6.56-v8-16k+): Froze during 0 of 4 tests
6.6.47+rpt-rpi-2712 (from the distro): Froze during 3 of 3 tests

I'll keep using the pulls/6418 build with dtparam=sd_cqe=on from now on and if it should hang I'll report.

In other news: My SD card now gives off a distinct glow in the dark. :-)

jfikar commented 1 month ago

I have some good news! I'm still testing the stock 6.6.51-1+rpt3 (2024-10-08) kernel, didn't have time to test the 6418 yet.

I have in total 4 different A2 SD cards, including the new official Raspberry Pi A2 card:

Raspberry Pi A2 64GB - works well wit CQE enabled, no errors, no recovery, no problem with swap partition!
SanDisk Extreme Pro A2 128GB black - errors, recovery, does not work with swap partition
SanDisk Extreme A2 64GB gold - errors, recovery, does not work with swap partition
Lexar 667x A2 128GB - CQE is not enabled!, either it is blacklisted, or it is a fake SD card, it also has very low IOPS

The recovery messages in dmesg are not so frequent. I'm checksumming 5GB of files on the SD card 100 times and I get on average only about 20 recovery messages. So it is 1/5.

On the other hand, the swap partition creates problems almost every time, approximately in 4/5 of the reboots. Now I think it is related to the discard option. The swap partition without discard works fine. It looks like the cards 2 and 3 are very sensitive to fstrim.

Fstrim alone works as expected, but fstrim immediately followed by a card read has a fatal effect. (This is probably what happens activating a swap partition with discard during boot.)

$ sudo fstrim -va;sudo dd if=/dev/mmcblk0 bs=1M of=/dev/null status=progress count=4096
/boot/firmware: 352.8 MiB (369897472 bytes) trimmed on /dev/mmcblk0p1
/: 85.7 MiB (89829376 bytes) trimmed on /dev/mmcblk0p2
bash: /usr/bin/sudo: Input/output error

I'll try to make it more reproducible.

P33M commented 1 month ago

fstrim works for me on a Sandisk 64GB Extreme Pro A2. Lots of these cards should get trim/discard demoted to an erase operation, as there is a pre-existing quirk for broken behaviour. Please provide card 2+3 register information as per the command earlier in this issue, and the output of dmesg | grep mmc0 - there should be a "quirks" line.

fshimizu commented 1 month ago

If it helps, here is the quirks line of my Sandisk Extreme (non-Pro) A2 256 for which I provided the register info earlier.

[    0.670245] mmcblk0: mmc0:aaaa SN256 238 GiB (quirks 0x00004000)

Edit: This is on pulls/6418 with dtparam=sd_cqe=on.

jfikar commented 1 month ago

Well, I don't think this is just a trim problem. Actually it is CQE+trim problem. With CQE disabled, trim works just fine on 2 and 3.

Looks like the trim leaves CQE in a bad state, which corrupts the first read just after?

Quirks:

mmcblk0: mmc0:544c USD00 58.9 GiB
mmcblk0: mmc0:aaaa SF128 119 GiB (quirks 0x00004000)
mmcblk0: mmc0:aaaa SN64G 59.5 GiB (quirks 0x00004000)
mmcblk0: mmcblk0: mmc0:59b4 LX128 118 GiB

The following test fails on 2 and 3 with CQE enabled, while passes with CQE disabled:

$ sudo fstrim -a;ls
ls: error while loading shared libraries: etjmp: cannot open shared object file: No such file or directory

sometimes there is another error like ls: error while loading shared libraries: chk: cannot open shared object file: No such file or directory

CQE enabled failure rate: 10/10 ! on cards 2 and 3 CQE disabled success rate: 10/10 ! on cards 2 and 3

Card 1 passes the test with CQE enabled. With card 4 I can't enable CQE.

SD card info: 1:

/sys/bus/mmc/devices/mmc0:544c/fwrev
/sys/bus/mmc/devices/mmc0:544c/uevent
DRIVER=mmcblk
MMC_TYPE=SD
MMC_NAME=USD00
MODALIAS=mmc:block
/sys/bus/mmc/devices/mmc0:544c/cid
ad4c5355534430302170084b6aa18800
/sys/bus/mmc/devices/mmc0:544c/rca
0x544c
/sys/bus/mmc/devices/mmc0:544c/ext_power
01
/sys/bus/mmc/devices/mmc0:544c/csd
400e0032db790001d77b7f800a400000
/sys/bus/mmc/devices/mmc0:544c/manfid
0x0000ad
/sys/bus/mmc/devices/mmc0:544c/power/runtime_active_time
6263664
/sys/bus/mmc/devices/mmc0:544c/power/runtime_status
active
/sys/bus/mmc/devices/mmc0:544c/power/autosuspend_delay_ms
3000
/sys/bus/mmc/devices/mmc0:544c/power/runtime_suspended_time
0
/sys/bus/mmc/devices/mmc0:544c/power/control
on
/sys/bus/mmc/devices/mmc0:544c/ocr
0x00300000
/sys/bus/mmc/devices/mmc0:544c/preferred_erase_size
8388608
/sys/bus/mmc/devices/mmc0:544c/ext_perf
18
/sys/bus/mmc/devices/mmc0:544c/type
SD
/sys/bus/mmc/devices/mmc0:544c/hwrev
0x2
/sys/bus/mmc/devices/mmc0:544c/date
08/2024
/sys/bus/mmc/devices/mmc0:544c/dsr
0x404
/sys/bus/mmc/devices/mmc0:544c/erase_size
512
/sys/bus/mmc/devices/mmc0:544c/removable
removable
/sys/bus/mmc/devices/mmc0:544c/oemid
0x4c53
/sys/bus/mmc/devices/mmc0:544c/serial
0x70084b6a
/sys/bus/mmc/devices/mmc0:544c/ssr
00000000080000000400a00100fd3a1e000800000002fc0000000000000000000000000000000000000000000000000000000000000000000000000000000000
/sys/bus/mmc/devices/mmc0:544c/scr
0245848700000000
/sys/bus/mmc/devices/mmc0:544c/name
USD00

2:

/sys/bus/mmc/devices/mmc0:aaaa/fwrev
0x0
/sys/bus/mmc/devices/mmc0:aaaa/uevent
DRIVER=mmcblk
MMC_TYPE=SD
MMC_NAME=SF128
MODALIAS=mmc:block
/sys/bus/mmc/devices/mmc0:aaaa/cid
035344534631323880281ba4f0014a00
/sys/bus/mmc/devices/mmc0:aaaa/rca
0xaaaa
/sys/bus/mmc/devices/mmc0:aaaa/ext_power
07
/sys/bus/mmc/devices/mmc0:aaaa/csd
400e0032db790003b8ab7f800a404000
/sys/bus/mmc/devices/mmc0:aaaa/manfid
0x000003
/sys/bus/mmc/devices/mmc0:aaaa/power/runtime_active_time
626219
/sys/bus/mmc/devices/mmc0:aaaa/power/runtime_status
active
/sys/bus/mmc/devices/mmc0:aaaa/power/autosuspend_delay_ms
3000
/sys/bus/mmc/devices/mmc0:aaaa/power/runtime_suspended_time
0
/sys/bus/mmc/devices/mmc0:aaaa/power/control
on
/sys/bus/mmc/devices/mmc0:aaaa/ocr
0x00300000
/sys/bus/mmc/devices/mmc0:aaaa/preferred_erase_size
4194304
/sys/bus/mmc/devices/mmc0:aaaa/ext_perf
18
/sys/bus/mmc/devices/mmc0:aaaa/type
SD
/sys/bus/mmc/devices/mmc0:aaaa/hwrev
0x8
/sys/bus/mmc/devices/mmc0:aaaa/date
10/2020
/sys/bus/mmc/devices/mmc0:aaaa/dsr
0x404
/sys/bus/mmc/devices/mmc0:aaaa/erase_size
512
/sys/bus/mmc/devices/mmc0:aaaa/removable
removable
/sys/bus/mmc/devices/mmc0:aaaa/oemid
0x5344
/sys/bus/mmc/devices/mmc0:aaaa/serial
0x281ba4f0
/sys/bus/mmc/devices/mmc0:aaaa/ssr
0000000008000000040090000f05391e000800000002fc0003000000000000000000000000000000000000000000000000000000000000000000000000000000
/sys/bus/mmc/devices/mmc0:aaaa/scr
0245848700000000
/sys/bus/mmc/devices/mmc0:aaaa/name
SF128

3:

/sys/bus/mmc/devices/mmc0:aaaa/fwrev
0x0
/sys/bus/mmc/devices/mmc0:aaaa/uevent
DRIVER=mmcblk
MMC_TYPE=SD
MMC_NAME=SN64G
MODALIAS=mmc:block
/sys/bus/mmc/devices/mmc0:aaaa/cid
035344534e36344780abf1d45c013100
/sys/bus/mmc/devices/mmc0:aaaa/rca
0xaaaa
/sys/bus/mmc/devices/mmc0:aaaa/ext_power
07
/sys/bus/mmc/devices/mmc0:aaaa/csd
400e0032db790001dbd37f800a404000
/sys/bus/mmc/devices/mmc0:aaaa/manfid
0x000003
/sys/bus/mmc/devices/mmc0:aaaa/power/runtime_active_time
3154928
/sys/bus/mmc/devices/mmc0:aaaa/power/runtime_status
active
/sys/bus/mmc/devices/mmc0:aaaa/power/autosuspend_delay_ms
3000
/sys/bus/mmc/devices/mmc0:aaaa/power/runtime_suspended_time
0
/sys/bus/mmc/devices/mmc0:aaaa/power/control
on
/sys/bus/mmc/devices/mmc0:aaaa/ocr
0x00300000
/sys/bus/mmc/devices/mmc0:aaaa/preferred_erase_size
4194304
/sys/bus/mmc/devices/mmc0:aaaa/ext_perf
18
/sys/bus/mmc/devices/mmc0:aaaa/type
SD
/sys/bus/mmc/devices/mmc0:aaaa/hwrev
0x8
/sys/bus/mmc/devices/mmc0:aaaa/date
01/2019
/sys/bus/mmc/devices/mmc0:aaaa/dsr
0x404
/sys/bus/mmc/devices/mmc0:aaaa/erase_size
512
/sys/bus/mmc/devices/mmc0:aaaa/removable
removable
/sys/bus/mmc/devices/mmc0:aaaa/oemid
0x5344
/sys/bus/mmc/devices/mmc0:aaaa/serial
0xabf1d45c
/sys/bus/mmc/devices/mmc0:aaaa/ssr
0000000008000000040090000f05391e000800000002fc0003000000000000000000000000000000000000000000000000000000000000000000000000000000
/sys/bus/mmc/devices/mmc0:aaaa/scr
0245848700000000
/sys/bus/mmc/devices/mmc0:aaaa/name
SN64G

4:

/sys/bus/mmc/devices/mmc0:59b4/fwrev
0x0
/sys/bus/mmc/devices/mmc0:59b4/uevent
DRIVER=mmcblk
MMC_TYPE=SD
MMC_NAME=LX128
MODALIAS=mmc:block
/sys/bus/mmc/devices/mmc0:59b4/cid
9c534f4c5831323810298208ca013a00
/sys/bus/mmc/devices/mmc0:59b4/rca
0x59b4
/sys/bus/mmc/devices/mmc0:59b4/ext_power
00
/sys/bus/mmc/devices/mmc0:59b4/csd
400e0032db790003b14f7f800a400000
/sys/bus/mmc/devices/mmc0:59b4/manfid
0x00009c
/sys/bus/mmc/devices/mmc0:59b4/power/runtime_active_time
2027526
/sys/bus/mmc/devices/mmc0:59b4/power/runtime_status
active
/sys/bus/mmc/devices/mmc0:59b4/power/autosuspend_delay_ms
3000
/sys/bus/mmc/devices/mmc0:59b4/power/runtime_suspended_time
0
/sys/bus/mmc/devices/mmc0:59b4/power/control
on
/sys/bus/mmc/devices/mmc0:59b4/ocr
0x00300000
/sys/bus/mmc/devices/mmc0:59b4/preferred_erase_size
8388608
/sys/bus/mmc/devices/mmc0:59b4/ext_perf
08
/sys/bus/mmc/devices/mmc0:59b4/type
SD
/sys/bus/mmc/devices/mmc0:59b4/hwrev
0x1
/sys/bus/mmc/devices/mmc0:59b4/date
10/2019
/sys/bus/mmc/devices/mmc0:59b4/dsr
0x404
/sys/bus/mmc/devices/mmc0:59b4/erase_size
512
/sys/bus/mmc/devices/mmc0:59b4/removable
removable
/sys/bus/mmc/devices/mmc0:59b4/oemid
0x534f
/sys/bus/mmc/devices/mmc0:59b4/serial
0x298208ca
/sys/bus/mmc/devices/mmc0:59b4/ssr
00000000080000000400a00100fd3a1e000800000002fc0000000000000000000000000000000000000000000000000000000000000000000000000000000000
/sys/bus/mmc/devices/mmc0:59b4/scr
0245848700000000
/sys/bus/mmc/devices/mmc0:59b4/name
LX128

P33M commented 1 month ago

The mechanism by which fstrim issues discard ops has the same potential failure mode as fsync in the apt kernel (hang with no obvious IO error). When trimming a large filesystem with many free extents, hundreds of trim commands for approx. the preferred erase size are issued back-to-back. Does the rpi-update version fix this?

jfikar commented 1 month ago

No, unfortunately pulls/6418 does not fix the problem. I found out that actually the checksum of /usr/bin/ls changes. You can try any other command instead of ls, it does not matter much.

reboot
sudo fstrim -a;ls

$ b3sum /usr/bin/ls
a31d8bd1e4b68d94d38b3a8b73c3167753c0bf2084035fd7722008623086b38b  /usr/bin/ls

reboot

$ b3sum /usr/bin/ls
049086e725258375365d68b2f1fde51acfc5f9b2b89504305bb372edf61931ad  /usr/bin/ls

Actually just the first 64kB of ls are different and they contain garbage. However, the very first 1.5kB are exactly the same. It is strange.

P33M commented 1 month ago

Cards 2&3 are Sandisk models that fall within the range of dates for cards with known-bad flush behaviour. So I would expect them to be unreliable anyway.

However, I have a card that is newer than the newest card with known-bad flush behaviour - Sandisk Extreme Pro A2 64GB, that does not have the cache flush issue, but does get hosed if fstrim is invoked.

md5sum sd.test.file && sudo fstrim -v -a && md5sum sd.test.file
7bf1afb705be6f885946144deb881bef  sd.test.file
/media/pi/bootfs: 407.7 MiB (427452416 bytes) trimmed on /dev/mmcblk0p1
/media/pi/rootfs: 41.8 GiB (44896120832 bytes) trimmed on /dev/mmcblk0p2
/boot/firmware: 360.2 MiB (377747456 bytes) trimmed on /dev/nvme0n1p1
/: 443.2 GiB (475885969408 bytes) trimmed on /dev/nvme0n1p2
7bf1afb705be6f885946144deb881bef  sd.test.file

 sd.test.file
/media/pi/bootfs: 407.4 MiB (427190272 bytes) tri[  144.220728] EXT4-fs error (device mmcblk0p2): ext4_validate_block_bitmap:421: comm fstrim: bg 1: bad block bitmap checksum
mmed on /dev/mmcblk0p1
[  144.245357] EXT4-fs (mmcblk0p2): Remounting filesystem read-only
/media/pi/rootfs: 0 B (0 bytes) trimmed on /dev/mmcblk0p2
/boot/firmware: 360.2 MiB (377747456 bytes) trimmed on /dev/nvme0n1p1
/: 0 B (0 bytes) trimmed on /dev/nvme0n1p2
md5sum: sd.test.file: Input/output error

In the garbled line where the kernel splat was printed to the console, I did echo 3 | sudo tee /proc/sys/vm/drop_caches - so the first read was probably ext4 metadata.

P33M commented 1 month ago

This is a separate failure mode from the OP, so please create a new issue.

P33M commented 1 month ago

Actually this is CQE related, but due to a clumsy interaction between the discard quirk and the spec.

CMD38 is supported in CQ mode, but only with a Discard op. The quirk (applied to all Sandisk cards) demotes Discard to Erase, which presumably results in undefined behaviour. If I check for the correct option and disable CQ otherwise, the Sandisk card no longer returns bogus data.

Pull request updated, and in about 45 minutes the autobulds will complete. Re-run rpi-update (delete /boot/firmware/.firmware_revision first).

jfikar commented 1 month ago

I've tried version 15645c8e..., which gives kernel 6.6.56-v8-16k+ #1805 SMP PREEMPT Tue Oct 15 20:06:35 BST 2024, but the problem persists.

pelwell commented 1 month ago

The current state-of-the-art is sudo rpi-update pulls/6419.

jfikar commented 1 month ago

That fixes all errors on cards 2 and 3. I tested all the cases. Good job, thanks!

ViRb3 commented 1 month ago

@P33M thanks so much for your work! When can we expect this to arrive in regular apt upgrade?

jfikar commented 3 weeks ago

CMD38 is supported in CQ mode, but only with a Discard op. The quirk (applied to all Sandisk cards) demotes Discard to Erase, which presumably results in undefined behaviour. If I check for the correct option and disable CQ otherwise, the Sandisk card no longer returns bogus data.

Oh, I've realized only now that the CQ on cards 2 and 3 is actually disabled now. The performance without CQ is lower than with CQ. For example card 2 gives 3800 with CQ and only 2000 without CQ as measured by pibenchmarks. Isn't there a way to keep CQ on and instead sanitize the trim call?

P33M commented 3 weeks ago

Oh, I've realized only now that the CQ on cards 2 and 3 is actually disabled now. The performance without CQ is lower than with CQ. For example card 2 gives 3800 with CQ and only 2000 without CQ as measured by pibenchmarks. Isn't there a way to keep CQ on and instead sanitize the trim call?

You're conflating two things - issuing commands to cards in the wrong state (a software bug), and cards that do not respond correctly to commands in CQ mode even if the state is correct (buggy cards). I can fix the first, but I can't fix the second. So those cards don't get CQ enabled.

jfikar commented 3 weeks ago

Would it be possible to flush CQ after trim for those cards? Or issue a fake read of 64kB after trim? (maybe 128kB to be sure).

I know, these cards have at least one more issue - the one resulting in CQE recovery. But it seems that CQE recovery returns correct data. And it is much less frequent than the read-after-trim issue, which hits every time.

raspberrypi / linux