gnubee-git / GnuBee_Docs

Other
54 stars 20 forks source link

Reports of data corruption. #78

Closed neilbrown closed 6 years ago

neilbrown commented 6 years ago

I've seen a few reports of data corruption on the Gnubee, but not experienced it myself. If anyone does experience data corruption, please report details here. Include:

I'm unlikely to spend much time looking into problems unless they are reported against mainline (4.15 or later). However having reports against older kernels might still help complete the picture, and if there is a problem in mainline to fix, the fix will quite possibly apply to older kernels too.

neilbrown commented 6 years ago

Kernel: Linux gnubee-pc1.gnubee 4.15.17 CPU clock: 900000000 storage: 2 different SATA devices in slots 0 and 2 filesystem: xfs and ext4

I used dd to copy /dev/mmcblk0 (which was active as my root filesystem) to a file on sda1, then used dd to copy that file to a file on /dev/sdb1. When I run md5sum on these two 2Gig files (which should be identical) I mostly get different results - different from each other and from time to time. When I "cmp -l" there are assorted differences, but they always come in aligned groups for 32 bytes.

32 bytes is the cache-line size - I cannot think any other way that 32 could be relevant. This suggests that when data is being DMAed from the device to memory to corresponding range is not purged from the cache - at least sometimes. Elsewhere @neheb observed that the problem might be specific to MIPS. This observation seems to support that.
I'll keep looking.

xvybihal commented 6 years ago

I also did small test, which did not result in data corruption. Anyway, I will summarize what I did, so some more data are available.

Kernel: Linux 4.15.17+ ([zcat /proc/config.gz] (https://gist.github.com/xvybihal/b984e4d2ba8da753cb46ba9bb8348c62)) from @neilbrown's git branch 4.15 CPU clock: spi-mt7621 1e000b00.spi: sys_freq: 90000000

My laptop: 3aa1393f366c0faec5ec38ede16a8094 /data/iso/Fedora-Workstation-Live-x86_64-27-1.6.iso

I have a NFS server on gnubee, and exported /mnt

/mnt has mounted btrfs filesystem created from three drives:

gnubee ~ # btrfs fi show
Label: 'DATATEST'  uuid: 80c43de4-0a3b-436a-bf89-6bf6b5c761b9
    Total devices 3 FS bytes used 32.97GiB
    devid    1 size 111.79GiB used 32.00MiB path /dev/sda
    devid    2 size 111.79GiB used 1.00GiB path /dev/sdb
    devid    3 size 298.09GiB used 35.03GiB path /dev/sdc

On Laptop: Mounted nfs share: 172.16.202.254:/mnt on /mnt/nfsjv type nfs (rw,relatime,vers=3,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=172.16.202.254,mountvers=3,mountport=37324,mountproto=udp,local_lock=none,addr=172.16.202.254)

Copied file to gnubee: rsync -av /data/iso/Fedora-Workstation-Live-x86_64-27-1.6.iso /mnt/nfsjv/ Copy was slow (3,817,503.53 bytes/sec), not sure why.

On GnuBee:

# md5sum /mnt/Fedora-Workstation-Live-x86_64-27-1.6.iso 
3aa1393f366c0faec5ec38ede16a8094  Fedora-Workstation-Live-x86_64-27-1.6.iso

Now I attached another drive into USB port and mounted. /dev/sdd1 on /data type ext4 (rw,relatime,data=ordered)

# cp /mnt/Fedora-Workstation-Live-x86_64-27-1.6.iso /data/
# md5sum /data/Fedora-Workstation-Live-x86_64-27-1.6.iso 
3aa1393f366c0faec5ec38ede16a8094  /data/Fedora-Workstation-Live-x86_64-27-1.6.iso
neheb commented 6 years ago

Kernel version - any 4.9 and after. I have not tested the non LTS kernels to see where the problem got introduced. Kernel 4.4 shows no errors.

Clock speed - 880Mhz or 900. Does not matter. Not even the RAM speed.

Storage - I have tested SATA and USB. Other MIPS devices like the Archer C7 are also affected.

Filesystem - btrfs and ext4 have been tested.

Dmesg - constant btrfs spam about wrong csums usually after 17 hours uptime.

Corruption - tested with torrents. After 3 days uptime, a torrent was verified from100% to 91% with ext4 and 100% to 97% with btrfs. This is most likely because btrfs tries to correct for silent data corruption.

Now here's the interesting part. Remember the ext4 result? It goes back to 100% after a reboot. I don't remember if this was the same with btrfs. The data corruption mitigations may have left it at 99%.

neilbrown commented 6 years ago

Thanks for the data points. The slow writes over NFS might be explained if you are using a module for btrfs or nfsd. Modules seem to run very slowly, I don't know why yet. The improvement after reboot suggests that the problem is introduced on reads. Maybe the on-disk data is correct, but when you read it, you sometimes get errors.

I've tried disabling CONFIG_HIGHMEM, but that didn't make an difference. Then I booted with nr-cpus=1 and now I don't see any corruption - it was easy to reproduce before. If people want to try this themselves, I think you need to build your own kernel. The way I did it was to edit arch/mips/boot/dts/ralink/gbpc1.dts and change to bootargs line in the chosen section to

bootargs = "console=ttyS0,57600 nr-cpus=1";

My guess is that the cache invalidation that happens before and after a DMA from the device is only invalidating the local cache - not the cache on other CPUs.

Adirelle commented 6 years ago

Just reporting everything is fine there, using Linux version 4.15.17+ (guillaume@localhost) (gcc version 7.2.0 (GCC)) #11 SMP Mon Apr 16 19:10:25 CEST 2018.

I have copied the first 2Gb of my root partition (/dev/sda) in a ext4 partition in a LVM mirrored logical volume, mounted on /srv. Then I copied the file again in the same partition. For now, sha1sum'ming them does not report any difference. The server is a NFSv4 server and also runs mysqld, transmission-daemon and hosts a seafile data directory.

neilbrown commented 6 years ago

It's sad that we are getting as many reports of "no corruption" as we are for "yes, corruption" - I hate intermittent bugs. After my first successes at getting corruption I tried some other kernels and couldn't get any corruption. I then went back to the original kernel and ... still no corruption. So now I'm doubting if my tests with nr-cpus=1 were meaningful. I think I read somewhere (cannot find it now) that someone thought it happened more on small machines, so I added mem=128M to my bootargs (mem=64M didn't work) and I saw corruption on 2 of my first 3 tests!!! Also 2 of my first 10 :-( Then I did echo 3 > /proc/sys/vm/drop_caches and tried again ... and immediately got corruption. Set nr-cpus=1 and tried again, dropping the caches every time. With nr-cpus=1 I got zero corruptions in 20 tests With nr-cpus=4 I got 7 corruptions in 20 tests. So it looks like mem=128M makes this repeatable enough. Now to see if I can work out how to flush the caches in all CPUs properly...

neheb commented 6 years ago

Sounds like my experience as well. Memory size was a hunch I had but did not do enough testing. My feeling is that some memory mapping thing is overflowing or something, making it easier to reproduce with smaller memory sizes (not a programmer so description is probably wrong).

That's interesting that with 1 CPU you get no corruption. I get it on ar71xx as well which is all single CPU. Maybe a separate issue with that platform.

If it helps, regression was introduced in kernel 4.5 through 4.9(inclusive). I tried looking under arch/mips/mm but found nothing useful. Not that I know enough about this to figure it out or if it's the right location...

neilbrown commented 6 years ago

If it helps, regression was introduced in kernel 4.5 ...

Given how hard it is to replicate, I wouldn't be surprised if the bug was present before 4.5, but for some reason didn't manifest. My current theory is broken hardware!!! The MIPS cache coherency module (CM) is supposed to propagate cache invalidation requests from one CPU core to the other, but I have fairly strong evidence that isn't happening. Maybe it is mis-configured, but I haven't found evidence of that yet. I have an ugly hack which sends an IPI (inter-processor-interrupt) to the other core to tell it to invalidate it's cache too, and when that is active, the corruption goes away (0 out of 200 tests). I haven't found really good documentation for the CM yet, so it's hard to be sure.

neheb commented 6 years ago

I've had 48 days uptime with 4.4 on my GnuBee with no data corruption of any kind. There's no way the hardware is broken.

There's also no way ar71xx routers are broken. Granted I've only tested with a single one.

Anyway, really good work on this. I would not have expected this to be a multi CPU issue. I recall John saying that it's "weird" on mt7621. I should probably ask him what he meant...

neilbrown commented 6 years ago

Bingo - it all make sense now. Thanks for being adamant that 4.4 really didn't have the bug. I went back and looked at that code again now that I have a fairly good idea how it all works, and immediately saw the important difference. The MIPS CM (coherence module) ensures that DMA operations are coherent w.r.t to the L2 cache, but not the per-core L1 cache (unless you have the optional IOCU, which we don't). So when the CPU initiates DMA it needs to explicitly write-back and invalidate the relevant parts of the cache - on all cores. The code takes one of two different approaches. If the memory range is less than the dcache size (32K) it tells the cache "Invalidate any cache line for this address", and does that for every address in the range. (well, for every 32nd address as the cache line is 32 bytes). This invalidate operation is a "HIT-type" operation and the CM "globalizes'" it - propagates it to all cores. If the memory range is 32K or more a (supposedly) more efficient process is used: The CPU tells the cache "Invalidate this line in the cache", giving the index in the cache of the cache line. It does this for every cache line (1024 of them) so the whole cache is invalidated. This is an "INDEX-type" operation and the CM does not globalizes these. That wouldn't make sense. So if you do a DMA operation to 32K or more, the caches aren't flushed properly. This is why I got more corruption after I used drop_caches to flush memory: with all memory unallocated, large contiguous allocations were more likely to be made when reading the file, so larger contiguous DMA was more likely. Before Linux 4.8 (and particularly before https://github.com/torvalds/linux/commit/c00ab4896ed5f7d89af6f90b809e2c0197c6d170) there was an extra test before using the global cache flush. It we never use it if CONFIG_MIPS_CMP or CONFIG_MIPS_CPS were defined (CPS might be an alternate name for CM - I'm not sure). Apparently all index operations were unreliable in some very early version of CPS so they were disabled. The mentioned commit removed this test because, apparently, those early CPS systems never left the lab. That may be true, but even if Index operations are reliable, they definitely don't propagate over the CM. I've push out an update to linux/gnubee/v4.15 which reverts this patch. It doesn't appear to have data corruption problems. The only way I can see this bug relating to uni-core ar71xx machines is if the Index-type operations don't work on the cache at all on the design, and that seems particularly unlikely.

I won't close this issue until a fix has been accepted upstream.

neheb commented 6 years ago

It seems the ar71xx issue is indeed different. Something to do with mmap being out of sync. An issue that was also introduced in kernel 4.9...

In any case, the pending kernel patch seems to be working well.

neilbrown commented 6 years ago

Thanks for the confirmation. The mips patch that fixes it hasn't landed upstream yet, not even in linux-next. I've asked about the status.

NeilBrown

On Sun, May 06 2018, Rosen Penev wrote:

It seems the ar71xx issue is indeed different. Something to do with mmap being out of sync. An issue that was also introduced in kernel 4.9...

In any case, the pending kernel patch seems to be working well.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/gnubee-git/GnuBee_Docs/issues/78#issuecomment-386904441

neilbrown commented 6 years ago

Patch is queued for mainline here: https://git.kernel.org/pub/scm/linux/kernel/git/jhogan/mips.git/commit/?h=mips-fixes&id=95cdba856f5e2c25b776eac31d8051e127db5eb6 So it should get into mainline so. I think that is enough to close this issue.