raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11.2k stars 5.02k forks source link

Silent corruption writing files to network share over cifs, from Raspberry Pi, while using certain compressors #6335

Open pronoiac opened 2 months ago

pronoiac commented 2 months ago

Describe the bug

wrong location?

First off, I got this repo from the package description for the installed kernel. Apologies if I'm not in the right spot.

Short version

I was benchmarking some compressors on Debian on a Raspberry Pi, piping to and from a network share on a NAS, and found that some consistently had issues writing to my NAS. Specifically: lzop, pigz (parallel gzip), and pbzip2 (parallel bzip2). This seems dependent on kernel version: Debian 11, bullseye, kernel 6.1.21, was ok. Debian 12, bookworm, kernel versions 6.6.20 and 6.6.31, were impacted.

Compiling and running a mainline kernel 6.1.21 on bookworm avoided the issue. I don’t think Debian patches are at fault.

There's over a year between those kernel releases. Bisecting won’t be quick, but it is doable.

Steps to reproduce the behaviour

It looks like this, on a mounted network share:

cat 1tb-rust-ext4.img.tar.gz  | \
  gzip -d | \
  lzop -1 > \
  1tb-rust-ext4.img.tar.lzop
# wait 40 minutes

cat 1tb-rust-ext4.img.tar.lzop | \
  lzop -d | \
  sha1sum
# it crashes, due to a corrupt file

Device (s)

Raspberry Pi 4 Mod. B

System

OS & version:

Raspberry Pi reference 2024-07-04 Generated using pi-gen, https://github.com/RPi-Distro/pi-gen, 48efb5fc5485fafdc9de8ad481eb5c09e1182656, stage4

Firmware version:

Can't open device file: /dev/vcio Try creating a device file with: sudo mknod /dev/vcio c 100 0

(That device file didn't help)

Kernel version:

Linux pillions 6.6.31+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.31-1+rpt1 (2024-05-29) aarch64 GNU/Linux

Logs

No response

Additional context

More details

The Pi and NAS are directly connected by Gigabit Ethernet. Both sides are using self-assigned IP addresses. The files in question are file systems, about 270 gig. Compression seems to work, without complaint; decompression crashes the process, usually within the first gig of the compressed file. It looks like the compressed files are corrupt. Trying decompression during compression gets further along than it does after compression finishes; this might point toward something with writes and caches. This is a Raspberry Pi 4, with 4 GiB RAM.

Wrong location, more details

I reported the issue to Debian, which they closed:

Debian does not ship 1:6.6.31-1+rpt1 version.

My impression:

Giving a heads up to the most likely impacted people makes sense -

6by9 commented 2 months ago

I vaguely recall an issue with CIFS in mainline not so long back - a fix had been backported in mainline erroneously.

We do build more recent kernels than are packaged into apt. On Raspberry Pi OS you can use sudo rpi-update to get the latest build of the current LTS branch (6.6 at the moment), or use eg sudo rpi-update rpi-6.10.y to grab the 6.10 kernel. It would be useful if you could tell us if the issue is still present in the latest 6.6 branch, and on 6.10.

Please be aware that there is a low-but-non-zero risk of regressions in taking these builds, so please test on a non-critical system, or at least backup first. Having a backup copy of the /boot/firmware/kernel*.img files to restore is generally sufficient, as rpi-update does not delete the old modules.

NB These CI builds are only available for 90 days after the last update on that branch, so generally it's only the LTS branch (6.6), the latest released branch (6.10), and the prepatch branch (6.11) that will be available.

pronoiac commented 2 months ago

From rpi-update:

6by9 commented 2 months ago

Interesting that it appears to be something that was broken by 6.6 and now fixed, but not backported.

If you're happy rebuilding the kernel, identifying whether the rpi-6.7.y, rpi-6.8.y, and rpi-6.9.y branches are good or not would be very useful. Unfortunately the CI build artifacts are likely to have expired for those branches, so it needs to be manual builds.

Sorry to ask you to do the investigative work, but you have a system setup that you can get to fail.

pelwell commented 2 months ago

I've forced rebuilds of rpi-6.7.y, rpi-6.8.y and rpi-6.9.y. Wait about 45 minutes then try sudo rpi-update rpi-6.7.y etc.

pelwell commented 2 months ago

(You can see the in-progress builds here: https://github.com/raspberrypi/linux/actions?query=is%3Ain_progress)

pelwell commented 2 months ago

They should be ready now.

pronoiac commented 2 months ago

My Internet connection's misbehaving today, but I will investigate when I can.

pronoiac commented 2 months ago

Possibly of note: the issue might go as far back as v6.3. Those builds are very helpful; building on my Pi takes about two hours.

pronoiac commented 2 months ago

I re-ran 6.8.12 - after the new eeprom - and it didn't work.

pronoiac commented 2 months ago

I've been looking for the fix for 6.10; I'm bisecting into its rc1.

pronoiac commented 2 months ago

Reading the rpi-update page (edit: new repo), it looks like it can pull in bleeding edge firmware, with risk of regressions. I intended to use it to pull in kernel 6.6.50, but then checking some kernels I'd built, I'm seeing breakage where it worked before.

Any suggestions?

popcornmix commented 2 months ago

Reading the rpi-update page

Check the first line of the readme.

I intended to use it to pull in kernel 6.6.50, but then checking some kernels I'd built, I'm seeing breakage where it worked before. Any suggestions?

Not based on what you've posted. If you post exactly what you did, and exactly what the breakage was it's possible there will be suggestions.

pronoiac commented 2 months ago

I updated the link, in case you were thinking, that's the deprecated rpi-update repo.

What I did:

Vaguely, some options I see:

popcornmix commented 2 months ago

I'm still not following which cases are which in "I'm seeing breakage where it worked before."

Is the breakage here the "Silent corruption writing files to network share over cifs" or something else? Are you saying rpi-update kernel behaves the same or differently to your self built one?

pronoiac commented 2 months ago

The network share breakage manifests as lzop failing to decompress, and that works, or doesn't, depending on the Linux kernel version. I've attempted bisection of the Linux kernel. rpi-update appears to change something in addition to the Linux kernel version, so that a kernel I'd tested, will stop working.

popcornmix commented 2 months ago

rpi-update may update bootloader and/or firmware (start.elf). There are options to disable that.

pelwell commented 2 months ago

so that a kernel I'd tested, will stop working.

Stop working in what way? Try to be less vague.

pronoiac commented 2 months ago

Stop working in what way? Try to be less vague.

I'll re-run the compression & decompression, and while they worked before, the decompression fails, as the file was corrupted.

pelwell commented 2 months ago

What you are describing sounds a lot like a random/timing-related issue, which would make testing challenging.