raspberrypi / linux

Kernel source tree for Raspberry Pi-provided kernel builds. Issues unrelated to the linux kernel should be posted on the community forum at https://forums.raspberrypi.com/
Other
11k stars 4.95k forks source link

Network driver on RPi3 B Plus causing hung tasks when working on an NFS mount #2482

Closed graysky2 closed 6 years ago

graysky2 commented 6 years ago

Platform/Distro: RPi 3B+ running Arch ARM (armv7h).
Kernel version: 4.14.31 (https://github.com/raspberrypi/linux/commit/b36f4e9e198477803d29861e02d3ea00fe5e09ab) Firmware version: latest as I write this (https://github.com/raspberrypi/firmware/commit/c14a90333c13f507ab219d583b74a998ec11a6e7)

Bug: Frequent kernel oops due to blocked tasks when writing files to NFS mount.

Details: When compiling, dmesg is full of kernel oops like the below when doing so on an NFS mount. Compiling to the micro SD card is fine. I believe that the software (disto) on the micro SD card is NOT to blame... if I put the same micro SD card into a RPi3 or RPi2, I can compile without error.

Again, I am using an NFS mounted partition (/scratch) on which to compile, so I'm hypothesizing that these problems are related to the network driver.

...
[ 2455.534291] INFO: task ld:24879 blocked for more than 120 seconds.
[ 2455.538489]       Tainted: G         C      4.14.31-1-ARCH #1
[ 2455.542688] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2455.550990] ld              D    0 24879  24804 0x00000000
[ 2455.555379] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 2455.559662] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 2455.563990] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 2455.572326] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 2455.580865] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[ 2455.589272] [<80233190>] (filemap_write_and_wait_range) from [<803db1c4>] (nfs_file_fsync+0x30/0x280)
[ 2455.597837] [<803db1c4>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[ 2455.606295] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[ 2455.610675] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[ 2455.614999] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 2547.695051] nfs: server ease not responding, still trying
[ 2548.735626] nfs: server ease not responding, still trying
[ 2548.768826] nfs: server ease OK
[ 2548.796748] nfs: server ease OK
[ 2701.296329] INFO: task ld:24879 blocked for more than 120 seconds.
[ 2701.300214]       Tainted: G         C      4.14.31-1-ARCH #1
[ 2701.304061] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2701.311642] ld              D    0 24879  24804 0x00000000
[ 2701.315536] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 2701.319458] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 2701.323355] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 2701.330878] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 2701.338447] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[ 2701.345916] [<80233190>] (filemap_write_and_wait_range) from [<803db1c4>] (nfs_file_fsync+0x30/0x280)
[ 2701.353469] [<803db1c4>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[ 2701.360953] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[ 2701.364740] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[ 2701.368593] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 2772.976750] nfs: server ease not responding, still trying
[ 2774.331264] nfs: server ease OK
[ 2947.057892] INFO: task ld:24879 blocked for more than 120 seconds.
[ 2947.061907]       Tainted: G         C      4.14.31-1-ARCH #1
[ 2947.066031] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2947.074107] ld              D    0 24879  24804 0x00000000
[ 2947.078244] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 2947.081483] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 2947.084348] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 2947.090033] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 2947.095898] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[ 2947.101751] [<80233190>] (filemap_write_and_wait_range) from [<803db1c4>] (nfs_file_fsync+0x30/0x280)
[ 2947.107513] [<803db1c4>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[ 2947.113352] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[ 2947.116350] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[ 2947.119289] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 2998.258064] nfs: server ease not responding, still trying
[ 2999.352463] nfs: server ease OK
[ 3192.819075] INFO: task ld:24879 blocked for more than 120 seconds.
[ 3192.823185]       Tainted: G         C      4.14.31-1-ARCH #1
[ 3192.827330] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3192.835447] ld              D    0 24879  24804 0x00000000
[ 3192.839604] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 3192.842832] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 3192.845750] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 3192.851476] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 3192.857318] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[ 3192.863126] [<80233190>] (filemap_write_and_wait_range) from [<803db1c4>] (nfs_file_fsync+0x30/0x280)
[ 3192.868837] [<803db1c4>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[ 3192.874594] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[ 3192.877558] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[ 3192.880466] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 3223.539141] nfs: server ease not responding, still trying
[ 3224.579687] nfs: server ease not responding, still trying
[ 3224.612015] nfs: server ease OK
[ 3224.626000] nfs: server ease OK
[ 3438.580109] INFO: task objcopy:24916 blocked for more than 120 seconds.
[ 3438.583905]       Tainted: G         C      4.14.31-1-ARCH #1
[ 3438.587697] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3438.595231] objcopy         D    0 24916  24912 0x00000000
[ 3438.599109] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 3438.603019] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 3438.606896] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 3438.614435] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 3438.622018] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[ 3438.629666] [<80233190>] (filemap_write_and_wait_range) from [<803db1c4>] (nfs_file_fsync+0x30/0x280)
[ 3438.637259] [<803db1c4>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[ 3438.644894] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[ 3438.648704] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[ 3438.652599] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 3448.820081] nfs: server ease not responding, still trying
[ 3450.148878] nfs: server ease OK
[ 3674.100906] nfs: server ease not responding, still trying
[ 3675.141506] nfs: server ease not responding, still trying
[ 3675.174279] nfs: server ease OK
[ 3675.202048] nfs: server ease OK
[ 3807.221430] INFO: task objcopy:24916 blocked for more than 120 seconds.
[ 3807.225253]       Tainted: G         C      4.14.31-1-ARCH #1
[ 3807.229007] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3807.236459] objcopy         D    0 24916  24912 0x00000000
[ 3807.240428] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 3807.244393] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 3807.248202] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 3807.255540] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 3807.263030] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[ 3807.270494] [<80233190>] (filemap_write_and_wait_range) from [<803db1c4>] (nfs_file_fsync+0x30/0x280)
[ 3807.277992] [<803db1c4>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[ 3807.285364] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[ 3807.289292] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[ 3807.293169] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 3899.381659] nfs: server ease not responding, still trying
[ 3900.422241] nfs: server ease not responding, still trying
[ 3900.461112] nfs: server ease OK
[ 3900.474540] nfs: server ease OK
[ 4011.372575] nf_conntrack: default automatic helper assignment has been turned off for security reasons and CT-based  firewall rule not found. Use the iptables CT target to attach helpers instead.
[ 4052.982250] INFO: task as:25088 blocked for more than 120 seconds.
[ 4052.986324]       Tainted: G         C      4.14.31-1-ARCH #1
[ 4052.990389] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4052.998504] as              D    0 25088  25086 0x00000000
[ 4053.002785] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 4053.006065] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 4053.008960] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 4053.014564] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 4053.020330] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[ 4053.026110] [<80233190>] (filemap_write_and_wait_range) from [<803db1c4>] (nfs_file_fsync+0x30/0x280)
[ 4053.031705] [<803db1c4>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[ 4053.037527] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[ 4053.040507] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[ 4053.043431] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 4134.902727] nfs: server ease not responding, still trying
[ 4135.997194] nfs: server ease OK
[ 4529.145918] nfs: server ease not responding, still trying
[ 4529.145923] nfs: server ease not responding, still trying
[ 4529.145940] nfs: server ease not responding, still trying
[ 4529.145978] nfs: server ease not responding, still trying
[ 4529.146011] nfs: server ease not responding, still trying
[ 4529.146028] nfs: server ease not responding, still trying
[ 4529.146044] nfs: server ease not responding, still trying
[ 4538.105971] nfs: server ease not responding, still trying
[ 4538.109131] nfs: server ease not responding, still trying
[ 4544.506128] INFO: task gcc:2854 blocked for more than 120 seconds.
[ 4544.509193]       Tainted: G         C      4.14.31-1-ARCH #1
[ 4544.512157] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4544.517957] gcc             D    0  2854   2852 0x00000000
[ 4544.520871] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 4544.523830] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 4544.526762] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 4544.530980] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 4544.534883] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[ 4544.538880] [<80233190>] (filemap_write_and_wait_range) from [<803db1c4>] (nfs_file_fsync+0x30/0x280)
[ 4544.542873] [<803db1c4>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[ 4544.546949] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[ 4544.549173] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[ 4544.551445] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 4571.406855] nfs: server ease OK
[ 4571.406996] nfs: server ease OK
[ 4571.407031] nfs: server ease OK
[ 4571.407691] nfs: server ease OK
[ 4571.407701] nfs: server ease OK
[ 4571.410844] nfs: server ease OK
[ 4571.410877] nfs: server ease OK
[ 4571.411761] nfs: server ease OK
[ 4571.411810] nfs: server ease OK
[ 4790.267644] INFO: task ld:7630 blocked for more than 120 seconds.
[ 4790.270597]       Tainted: G         C      4.14.31-1-ARCH #1
[ 4790.273588] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4790.279563] ld              D    0  7630   7628 0x00000000
[ 4790.282558] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 4790.285531] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 4790.288488] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 4790.294136] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 4790.299855] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[ 4790.305556] [<80233190>] (filemap_write_and_wait_range) from [<803db1c4>] (nfs_file_fsync+0x30/0x280)
[ 4790.311366] [<803db1c4>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[ 4790.317112] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[ 4790.320380] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[ 4790.323699] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 4790.330500] INFO: task ld:7636 blocked for more than 120 seconds.
[ 4790.334181]       Tainted: G         C      4.14.31-1-ARCH #1
[ 4790.338097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4790.346223] ld              D    0  7636   7633 0x00000000
[ 4790.350304] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 4790.354463] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 4790.358593] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 4790.366494] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 4790.374744] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[ 4790.383021] [<80233190>] (filemap_write_and_wait_range) from [<803db1c4>] (nfs_file_fsync+0x30/0x280)
[ 4790.391236] [<803db1c4>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[ 4790.399371] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[ 4790.403607] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[ 4790.407831] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
graysky2 commented 6 years ago

An easy way to trigger this bug (if you don't want to try compiling the kernel package) is to simply use dd to write out from /dev/zero to the NFS mount. For example on my RPi3 B+:

# mount ease:/scratch /scratch-nfs
% dd if=/dev/zero of=/scratch-nfs/fill bs=4M count=1000 status=progress
964689920 bytes (965 MB, 920 MiB) copied, 149 s, 6.5 MB/s

<<< it froze up after about 965 MB written >>>
<<< In dmesg I get another server not responding error >>>

[ 5112.824818] nfs: server ease not responding, still trying
[ 5149.707808] nfs: server ease OK

Now, if I swap out the micro SD and boot into a RPi 2 I have lying around, same network cable, same power supply, and repeat the commands, everything works as expected. I think that helps to rule out the NFS server, network hardware etc. as potentially to blame.

# mount ease:/scratch /scratch-nfs
% dd if=/dev/zero of=/scratch-nfs/fill bs=4M count=1000 status=progress
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 346 s, 12.1 MB/s
1000+0 records in
1000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 357.595 s, 11.7 MB/s
dd if=/dev/zero of=/scratch-nfs/fill bs=4M count=1000 status=progress  0.00s user 24.47s system 5% cpu 8:03.99 total
pelwell commented 6 years ago

Does disabling Energy Efficient Ethernet make a difference? Add dtparam=eee=off to config.txt and reboot.

But before trying that you can confirm whether EEE is active using ethtool --show-eee eth0.

graysky2 commented 6 years ago

Great suggestion, @pelwell. I got some very encouraging results using the dd test which floods the I/O with a steady stream of data. It "passed" meaning no timeouts writing and no server not responding messages via dmesg. I am compiling the same package that consistently gives the errors now and will post back with those results.

Before:

# ethtool --show-eee eth0
EEE Settings for eth0:
    EEE status: enabled - active
...

After:

# ethtool --show-eee eth0
EEE Settings for eth0:
    EEE status: disabled
...

The test with dd:

# mount ease:/scratch /scratch-nfs

% dd if=/dev/zero of=/scratch-nfs/fill bs=4M count=1000 status=progress && rm fill
4169138176 bytes (4.2 GB, 3.9 GiB) copied, 97 s, 42.9 MB/s 
1000+0 records in
1000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 100.665 s, 41.7 MB/s
dd if=/dev/zero of=/scratch-nfs/fill bs=4M count=1000 status=progress  0.00s user 13.79s system 13% cpu 1:40.68 total

% dd if=/dev/zero of=/scratch-nfs/fill bs=4M count=2000 status=progress && rm fill
8380219392 bytes (8.4 GB, 7.8 GiB) copied, 198 s, 42.3 MB/s
2000+0 records in
2000+0 records out
8388608000 bytes (8.4 GB, 7.8 GiB) copied, 201.245 s, 41.7 MB/s
dd if=/dev/zero of=/scratch-nfs/fill bs=4M count=2000 status=progress  0.00s user 27.98s system 13% cpu 3:21.25 total

% dd if=/dev/zero of=/scratch-nfs/fill bs=4M count=2000 status=progress && rm fill
8380219392 bytes (8.4 GB, 7.8 GiB) copied, 198 s, 42.3 MB/s
2000+0 records in
2000+0 records out
8388608000 bytes (8.4 GB, 7.8 GiB) copied, 201.052 s, 41.7 MB/s
dd if=/dev/zero of=/scratch-nfs/fill bs=4M count=2000 status=progress  0.00s user 28.23s system 13% cpu 3:22.19 total

Unfortunately, when compiling which as you can reconize, writes out data must less frequently than dd does, I am experiencing the same errors:

[ 3315.685473] INFO: task gzip:29769 blocked for more than 120 seconds.
[ 3315.685636]       Tainted: G         C      4.14.31-1-ARCH #1
[ 3315.685767] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3315.685955] gzip            D    0 29769  29767 0x00000000
[ 3315.686127] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 3315.686299] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 3315.686473] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 3315.686663] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 3315.686875] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[ 3315.687121] [<80233190>] (filemap_write_and_wait_range) from [<803db1c4>] (nfs_file_fsync+0x30/0x280)
[ 3315.687349] [<803db1c4>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[ 3315.687529] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[ 3315.691478] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[ 3315.695540] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 3402.725251] nfs: server ease not responding, still trying
[ 3403.765783] nfs: server ease not responding, still trying
[ 3404.089089] nfs: server ease OK
[ 3404.089297] nfs: server ease OK
[ 3899.364008] nfs: server ease not responding, still trying
[ 3899.364013] nfs: server ease not responding, still trying
[ 3899.364028] nfs: server ease not responding, still trying
[ 3899.364060] nfs: server ease not responding, still trying
[ 3899.364071] nfs: server ease not responding, still trying
[ 3899.364076] nfs: server ease not responding, still trying
[ 3930.084023] INFO: task ld:13616 blocked for more than 120 seconds.
[ 3930.087086]       Tainted: G         C      4.14.31-1-ARCH #1
[ 3930.090229] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3930.096312] ld              D    0 13616  13612 0x00000000
[ 3930.099422] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 3930.102523] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 3930.105566] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 3930.111351] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 3930.117264] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[ 3930.123049] [<80233190>] (filemap_write_and_wait_range) from [<803db1c4>] (nfs_file_fsync+0x30/0x280)
[ 3930.129036] [<803db1c4>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[ 3930.135044] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[ 3930.138283] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[ 3930.141618] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 3941.625186] nfs: server ease OK
[ 3941.625295] nfs: server ease OK
[ 3941.625441] nfs: server ease OK
[ 3941.625829] nfs: server ease OK
[ 3941.635332] nfs: server ease OK
[ 3941.635549] nfs: server ease OK
[ 4170.727338] nfs: server ease not responding, still trying
[ 4170.727343] nfs: server ease not responding, still trying
[ 4170.727356] nfs: server ease not responding, still trying
[ 4170.727395] nfs: server ease not responding, still trying
[ 4170.727413] nfs: server ease not responding, still trying
[ 4170.727428] nfs: server ease not responding, still trying
[ 4170.727441] nfs: server ease not responding, still trying
[ 4170.727455] nfs: server ease not responding, still trying
[ 4170.727461] nfs: server ease not responding, still trying
[ 4170.727467] nfs: server ease not responding, still trying
[ 4175.847588] INFO: task gzip:22430 blocked for more than 120 seconds.
[ 4175.849590]       Tainted: G         C      4.14.31-1-ARCH #1
[ 4175.851594] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4175.855516] gzip            D    0 22430  22391 0x00000000
[ 4175.857549] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 4175.859576] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 4175.861543] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 4175.865280] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 4175.869533] [<80230d0c>] (__filemap_fdatawait_range) from [<80230d70>] (filemap_fdatawait_range+0x18/0x28)
[ 4175.874352] [<80230d70>] (filemap_fdatawait_range) from [<802330f4>] (filemap_write_and_wait+0x58/0x7c)
[ 4175.879764] [<802330f4>] (filemap_write_and_wait) from [<803ea028>] (nfs_wb_all+0x14/0x15c)
[ 4175.885618] [<803ea028>] (nfs_wb_all) from [<803dd96c>] (nfs_setattr+0x280/0x2a4)
[ 4175.892223] [<803dd96c>] (nfs_setattr) from [<802bf8d4>] (notify_change+0x17c/0x410)
[ 4175.899511] [<802bf8d4>] (notify_change) from [<802d62fc>] (utimes_common+0xbc/0x188)
[ 4175.907605] [<802d62fc>] (utimes_common) from [<802d64c8>] (do_utimes+0x100/0x144)
[ 4175.916359] [<802d64c8>] (do_utimes) from [<802d6548>] (SyS_utimensat+0x3c/0xb0)
[ 4175.925462] [<802d6548>] (SyS_utimensat) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 4175.934598] INFO: task cp:22444 blocked for more than 120 seconds.
[ 4175.939378]       Tainted: G         C      4.14.31-1-ARCH #1
[ 4175.944179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4175.953867] cp              D    0 22444  22422 0x00000000
[ 4175.958731] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 4175.963485] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 4175.968194] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 4175.977469] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 4175.986837] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[ 4175.996147] [<80233190>] (filemap_write_and_wait_range) from [<803db1c4>] (nfs_file_fsync+0x30/0x280)
[ 4176.005282] [<803db1c4>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[ 4176.014199] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[ 4176.018723] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[ 4176.023212] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 4176.031736] INFO: task gzip:22446 blocked for more than 120 seconds.
[ 4176.036150]       Tainted: G         C      4.14.31-1-ARCH #1
[ 4176.040488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4176.048923] gzip            D    0 22446  22413 0x00000000
[ 4176.053120] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 4176.057319] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 4176.061492] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 4176.069486] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 4176.077824] [<80230d0c>] (__filemap_fdatawait_range) from [<80230d70>] (filemap_fdatawait_range+0x18/0x28)
[ 4176.086158] [<80230d70>] (filemap_fdatawait_range) from [<802330f4>] (filemap_write_and_wait+0x58/0x7c)
[ 4176.094434] [<802330f4>] (filemap_write_and_wait) from [<803ea028>] (nfs_wb_all+0x14/0x15c)
[ 4176.102722] [<803ea028>] (nfs_wb_all) from [<803dd96c>] (nfs_setattr+0x280/0x2a4)
[ 4176.111286] [<803dd96c>] (nfs_setattr) from [<802bf8d4>] (notify_change+0x17c/0x410)
[ 4176.119906] [<802bf8d4>] (notify_change) from [<802d62fc>] (utimes_common+0xbc/0x188)
[ 4176.128677] [<802d62fc>] (utimes_common) from [<802d64c8>] (do_utimes+0x100/0x144)
[ 4176.137604] [<802d64c8>] (do_utimes) from [<802d6548>] (SyS_utimensat+0x3c/0xb0)
[ 4176.146598] [<802d6548>] (SyS_utimensat) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 4176.155652] INFO: task gzip:22448 blocked for more than 120 seconds.
[ 4176.160374]       Tainted: G         C      4.14.31-1-ARCH #1
[ 4176.165034] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4176.174526] gzip            D    0 22448  22399 0x00000000
[ 4176.179330] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 4176.183995] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 4176.188703] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 4176.197969] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 4176.207204] [<80230d0c>] (__filemap_fdatawait_range) from [<80230d70>] (filemap_fdatawait_range+0x18/0x28)
[ 4176.216304] [<80230d70>] (filemap_fdatawait_range) from [<802330f4>] (filemap_write_and_wait+0x58/0x7c)
[ 4176.225376] [<802330f4>] (filemap_write_and_wait) from [<803ea028>] (nfs_wb_all+0x14/0x15c)
[ 4176.234297] [<803ea028>] (nfs_wb_all) from [<803dd96c>] (nfs_setattr+0x280/0x2a4)
[ 4176.243314] [<803dd96c>] (nfs_setattr) from [<802bf8d4>] (notify_change+0x17c/0x410)
[ 4176.252319] [<802bf8d4>] (notify_change) from [<802d62fc>] (utimes_common+0xbc/0x188)
[ 4176.261418] [<802d62fc>] (utimes_common) from [<802d64c8>] (do_utimes+0x100/0x144)
[ 4176.270386] [<802d64c8>] (do_utimes) from [<802d6548>] (SyS_utimensat+0x3c/0xb0)
[ 4176.279426] [<802d6548>] (SyS_utimensat) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 4176.288484] INFO: task gzip:22449 blocked for more than 120 seconds.
[ 4176.293202]       Tainted: G         C      4.14.31-1-ARCH #1
[ 4176.297913] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4176.307420] gzip            D    0 22449  22402 0x00000000
[ 4176.312234] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 4176.316976] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 4176.321712] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 4176.330930] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 4176.340209] [<80230d0c>] (__filemap_fdatawait_range) from [<80230d70>] (filemap_fdatawait_range+0x18/0x28)
[ 4176.349341] [<80230d70>] (filemap_fdatawait_range) from [<802330f4>] (filemap_write_and_wait+0x58/0x7c)
[ 4176.358392] [<802330f4>] (filemap_write_and_wait) from [<803ea028>] (nfs_wb_all+0x14/0x15c)
[ 4176.367284] [<803ea028>] (nfs_wb_all) from [<803dd96c>] (nfs_setattr+0x280/0x2a4)
[ 4176.376299] [<803dd96c>] (nfs_setattr) from [<802bf8d4>] (notify_change+0x17c/0x410)
[ 4176.385274] [<802bf8d4>] (notify_change) from [<802d62fc>] (utimes_common+0xbc/0x188)
[ 4176.394337] [<802d62fc>] (utimes_common) from [<802d64c8>] (do_utimes+0x100/0x144)
[ 4176.403276] [<802d64c8>] (do_utimes) from [<802d6548>] (SyS_utimensat+0x3c/0xb0)
[ 4176.412262] [<802d6548>] (SyS_utimensat) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 4176.421295] INFO: task gzip:22452 blocked for more than 120 seconds.
[ 4176.425993]       Tainted: G         C      4.14.31-1-ARCH #1
[ 4176.430693] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4176.440149] gzip            D    0 22452  22418 0x00000000
[ 4176.444909] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 4176.449585] [<80a88018>] (schedule) from [<80a8b3d4>] (rwsem_down_write_failed+0x12c/0x278)
[ 4176.458924] [<80a8b3d4>] (rwsem_down_write_failed) from [<80a8a6f0>] (down_write+0x58/0x60)
[ 4176.468288] [<80a8a6f0>] (down_write) from [<802afc48>] (path_openat+0x3b0/0x1150)
[ 4176.477766] [<802afc48>] (path_openat) from [<802b1954>] (do_filp_open+0x6c/0xdc)
[ 4176.487122] [<802b1954>] (do_filp_open) from [<8029edc4>] (do_sys_open+0x168/0x20c)
[ 4176.496594] [<8029edc4>] (do_sys_open) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 4176.505990] INFO: task mkdir:22457 blocked for more than 120 seconds.
[ 4176.510857]       Tainted: G         C      4.14.31-1-ARCH #1
[ 4176.515599] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4176.525022] mkdir           D    0 22457  22453 0x00000000
[ 4176.529774] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 4176.534358] [<80a88018>] (schedule) from [<80a8b3d4>] (rwsem_down_write_failed+0x12c/0x278)
[ 4176.543475] [<80a8b3d4>] (rwsem_down_write_failed) from [<80a8a6f0>] (down_write+0x58/0x60)
[ 4176.552568] [<80a8a6f0>] (down_write) from [<802b1118>] (filename_create+0x70/0x14c)
[ 4176.561851] [<802b1118>] (filename_create) from [<802b1d30>] (SyS_mkdirat+0x4c/0xec)
[ 4176.571295] [<802b1d30>] (SyS_mkdirat) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 4176.580876] INFO: task mkdir:22458 blocked for more than 120 seconds.
[ 4176.585741]       Tainted: G         C      4.14.31-1-ARCH #1
[ 4176.590628] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4176.600118] mkdir           D    0 22458  22450 0x00000000
[ 4176.604882] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 4176.609664] [<80a88018>] (schedule) from [<80a8b3d4>] (rwsem_down_write_failed+0x12c/0x278)
[ 4176.618855] [<80a8b3d4>] (rwsem_down_write_failed) from [<80a8a6f0>] (down_write+0x58/0x60)
[ 4176.628053] [<80a8a6f0>] (down_write) from [<802b1118>] (filename_create+0x70/0x14c)
[ 4176.637238] [<802b1118>] (filename_create) from [<802b1d30>] (SyS_mkdirat+0x4c/0xec)
[ 4176.646634] [<802b1d30>] (SyS_mkdirat) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 4211.688544] nfs: server ease not responding, still trying
[ 4212.989190] nfs: server ease OK
[ 4212.989336] nfs: server ease OK
[ 4212.989372] nfs: server ease OK
[ 4212.992652] nfs: server ease OK
[ 4213.002084] nfs: server ease OK
[ 4213.002311] nfs: server ease OK
[ 4213.002416] nfs: server ease OK
[ 4213.017966] nfs: server ease OK
[ 4213.018012] nfs: server ease OK
[ 4213.018632] nfs: server ease OK
[ 4213.020006] nfs: server ease OK
[ 4401.133010] nfs: server ease not responding, still trying
[ 4401.133014] nfs: server ease not responding, still trying
[ 4401.133030] nfs: server ease not responding, still trying
[ 4401.133067] nfs: server ease not responding, still trying
[ 4401.133110] nfs: server ease not responding, still trying
[ 4401.133120] nfs: server ease not responding, still trying
[ 4401.133124] nfs: server ease not responding, still trying
[ 4401.133139] nfs: server ease not responding, still trying
[ 4401.133156] nfs: server ease not responding, still trying
[ 4401.133171] nfs: server ease not responding, still trying
[ 4401.133187] nfs: server ease not responding, still trying
[ 4401.133202] nfs: server ease not responding, still trying
[ 4401.133233] nfs: server ease not responding, still trying
[ 4401.133245] nfs: server ease not responding, still trying
[ 4401.133251] nfs: server ease not responding, still trying
[ 4443.397196] nfs: server ease OK
[ 4443.397213] nfs: server ease OK
[ 4443.397291] nfs: server ease OK
[ 4443.397316] nfs: server ease OK
[ 4443.397343] nfs: server ease OK
[ 4443.397410] nfs: server ease OK
[ 4443.397505] nfs: server ease OK
[ 4443.397580] nfs: server ease OK
[ 4443.397605] nfs: server ease OK
[ 4443.397714] nfs: server ease OK
[ 4443.399097] nfs: server ease OK
[ 4443.405096] nfs: server ease OK
[ 4443.405772] nfs: server ease OK
[ 4443.406117] nfs: server ease OK
[ 4443.406398] nfs: server ease OK
[ 4667.377155] nfs: server ease not responding, still trying
[ 4668.417708] nfs: server ease not responding, still trying
[ 4668.700017] nfs: server ease OK
[ 4668.700524] nfs: server ease OK
[ 4856.819062] nfs: server ease not responding, still trying
[ 4856.819067] nfs: server ease not responding, still trying
[ 4856.819082] nfs: server ease not responding, still trying
[ 4856.819130] nfs: server ease not responding, still trying
[ 4856.819135] nfs: server ease not responding, still trying
[ 4856.819142] nfs: server ease not responding, still trying
[ 4856.819154] nfs: server ease not responding, still trying
[ 4856.819174] nfs: server ease not responding, still trying
[ 4856.819188] nfs: server ease not responding, still trying
[ 4856.819209] nfs: server ease not responding, still trying
[ 4856.819216] nfs: server ease not responding, still trying
[ 4893.959982] nfs: server ease OK
[ 4893.960172] nfs: server ease OK
[ 4893.960210] nfs: server ease OK
[ 4893.960311] nfs: server ease OK
[ 4893.960640] nfs: server ease OK
[ 4893.960770] nfs: server ease OK
[ 4893.960780] nfs: server ease OK
[ 4893.961280] nfs: server ease OK
[ 4893.966452] nfs: server ease OK
[ 4893.967131] nfs: server ease OK
[ 4893.969369] nfs: server ease OK
[ 5123.060914] nfs: server ease not responding, still trying
[ 5124.101425] nfs: server ease not responding, still trying
[ 5124.376882] nfs: server ease OK
[ 5124.381100] nfs: server ease OK
[ 5353.461931] nfs: server ease not responding, still trying
[ 5354.784753] nfs: server ease OK
[ 5588.982673] nfs: server ease not responding, still trying
[ 5590.077559] nfs: server ease OK
[ 5814.263180] nfs: server ease not responding, still trying
[ 5815.303698] nfs: server ease not responding, still trying
[ 5815.334003] nfs: server ease OK
[ 5815.360538] nfs: server ease OK
[ 6044.663615] nfs: server ease not responding, still trying
[ 6045.721789] nfs: server ease OK
[ 6285.305546] nfs: server ease not responding, still trying
[ 6286.346054] nfs: server ease not responding, still trying
[ 6286.376999] nfs: server ease OK
[ 6286.403761] nfs: server ease OK
[ 6510.587277] nfs: server ease not responding, still trying
[ 6511.627865] nfs: server ease not responding, still trying
[ 6511.674761] nfs: server ease OK
[ 6511.686188] nfs: server ease OK
[ 6735.868562] nfs: server ease not responding, still trying
[ 6736.909076] nfs: server ease not responding, still trying
[ 6736.940771] nfs: server ease OK
[ 6736.967038] nfs: server ease OK
[ 6940.669438] nfs: server ease not responding, still trying
[ 6977.551872] nfs: server ease OK
graysky2 commented 6 years ago

I combined a few replies into one (above) and tried to make it it bit more concise. TL;DR version is that disabling EEE does not help.

pelwell commented 6 years ago

If possible, and if it isn't already on, can you enable flow control on the switch port connected to the Pi?

graysky2 commented 6 years ago

@pelwell - All the wired connections go through an unmanaged switch. No settings to tweak :/

mkreisl commented 6 years ago

Bug: Frequent kernel oops due to blocked tasks when writing files to NFS mount.

I had similar issues on SAMBA mount. But currently I can not run tests again, because I sent back my Pi3B+.

IMO current revision of Pi3B+ has serious hardware issues and I don't believe that they can be solved via software (Finally, I never was able to play a video longer than 15mins without a Kodi crash, kernel Oops, or freeze)

@pelwell and co: Which Pi3B+ (revision) are you currently using? Parts of 0-series or parts from current production line, which customers are using now.

I still can't believe that you guys never had such issues before

JamesH65 commented 6 years ago

Current production I believe. Although I don't think there have been many/any changes since the prototypes.

It appears that the issues are erratic, and depend on the capabilities of the network the device is attached to. We are trying to figure out the exact circumstances. I suspect there are a number of issues being seen, as often happens when a previously working driver suddenly because used by 250k extra people over a weekend, in all sorts of new and unpredictable ways. I have high hopes there is a software solution to this, we've always been able to find them in the past.

On 31 March 2018 at 19:29, Manfred Kreisl notifications@github.com wrote:

Bug: Frequent kernel oops due to blocked tasks when writing files to NFS mount.

I had similar issues on SAMBA mount. But currently I can not run tests again, because I sent back my Pi3B+.

IMO current revision of Pi3B+ has serious hardware issues and I don't believe that they can be solved via software (Finally, I never was able to play a video longer than 15mins without a Kodi crash, kernel Oops, or freeze)

@pelwell https://github.com/pelwell and co: Which Pi3B+ (revision) are you currently using? Parts of 0-series or parts from current production line, which customers are using now.

I still can't believe that you guys never had such issues before

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/raspberrypi/linux/issues/2482#issuecomment-377713477, or mute the thread https://github.com/notifications/unsubscribe-auth/ADqrHV52f7xwHg74CUm3wYPAGFDuyvYZks5tj8sFgaJpZM4TB_lw .

-- James Hughes Principal Software Engineer, Raspberry Pi (Trading) Ltd

Knoppix1 commented 6 years ago

Perso I back on pi 2b ... (He boot faster with the same SD card...)

@mkreisl when your pi is back if work normally I send my pi too

zmartell commented 6 years ago

+1 I am noticing this issue as well when reading off samba mount. Brand new RPI 3B+.

graysky2 commented 6 years ago

@pelwell - From @popcornmix's advice in https://github.com/raspberrypi/linux/issues/2442, I built:

I automated that dd test I described above in a simple script that repeats the writing out of 1G worth of zero filled file over an NFS share 32 times. I then used histogram.py to compute the stats.

With the dtparam=eee=off parameter set in /boot/config.txt I got some consistent results:

% histogram.py -p < results_no_eee.csv
# NumSamples = 32; Min = 25.46; Max = 25.75
# Mean = 25.687864; Variance = 0.002705; SD = 0.052009; Median 25.693114

When I removed that line (reverting to the default state of it being on, 1 of the 32 runs was really long:

% histogram.py -p < results.csv
# NumSamples = 36; Min = 25.34; Max = 139.44
# Mean = 28.763650; Variance = 350.005030; SD = 18.708421; Median 25.599488

Since using dd is going to max out the bus, I will try compiling the kernel which is much more gentle to the network IO and much more prone to errors in my experience. Thoughts?

graysky2 commented 6 years ago

OK... still experiencing the timeouts when compiling to the NFS share with eee enabled despite the successful replicates of using dd above. I am currently building https://github.com/raspberrypi/linux/commit/c2eb30683b43b13b931bd9cfef6a2a09ac7b7c1e and will test it by compiling the kernel to NFS with eee enabled and with it disabled.

For reference, here is the script to automate the replicate compile jobs.

graysky2 commented 6 years ago

@pelwell - I am still getting network timeouts... below is with dtparam=eee=off set booted into the latest kernel.

[11786.758187] nfs: server ease not responding, still trying
[11786.758192] nfs: server ease not responding, still trying
[11786.758206] nfs: server ease not responding, still trying
[11786.758225] nfs: server ease not responding, still trying
[11794.438353] INFO: task ld:25967 blocked for more than 120 seconds.
[11794.441599]       Tainted: G         C      4.14.32-2-ARCH #1
[11794.444867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11794.451496] ld              D    0 25967  25966 0x00000000
[11794.454918] [<80a87c48>] (__schedule) from [<80a88418>] (schedule+0x3c/0xa0)
[11794.458408] [<80a88418>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[11794.461670] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[11794.468043] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[11794.474533] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[11794.481165] [<80233190>] (filemap_write_and_wait_range) from [<803db254>] (nfs_file_fsync+0x30/0x280)
[11794.487571] [<803db254>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[11794.494001] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[11794.497339] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[11794.500698] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[11831.579322] nfs: server ease OK
[11831.579326] nfs: server ease OK
[11831.583067] nfs: server ease OK
[11831.583118] nfs: server ease OK
[12040.199240] INFO: task ld:27693 blocked for more than 120 seconds.
[12040.202836]       Tainted: G         C      4.14.32-2-ARCH #1
[12040.206449] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12040.213627] ld              D    0 27693  27692 0x00000000
[12040.217311] [<80a87c48>] (__schedule) from [<80a88418>] (schedule+0x3c/0xa0)
[12040.220971] [<80a88418>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[12040.223568] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[12040.228677] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[12040.233740] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[12040.238903] [<80233190>] (filemap_write_and_wait_range) from [<803db254>] (nfs_file_fsync+0x30/0x280)
[12040.244189] [<803db254>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[12040.249445] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[12040.252070] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[12040.254713] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[12101.639302] nfs: server ease not responding, still trying
[12101.639311] nfs: server ease not responding, still trying
[12101.639328] nfs: server ease not responding, still trying
[12142.599536] nfs: server ease not responding, still trying
[12143.639966] nfs: server ease not responding, still trying
[12143.900616] nfs: server ease OK
[12143.900633] nfs: server ease OK
[12143.909707] nfs: server ease OK
[12143.917548] nfs: server ease OK
[12143.917848] nfs: server ease OK
[12408.840196] nfs: server ease not responding, still trying
[12408.840200] nfs: server ease not responding, still trying
[12408.840228] nfs: server ease not responding, still trying
[12408.840248] nfs: server ease not responding, still trying
[12408.840274] nfs: server ease not responding, still trying
[12408.840412] INFO: task ld:29538 blocked for more than 120 seconds.
[12408.840421]       Tainted: G         C      4.14.32-2-ARCH #1
[12408.840424] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12408.840430] ld              D    0 29538  29537 0x00000000
[12408.840493] [<80a87c48>] (__schedule) from [<80a88418>] (schedule+0x3c/0xa0)
[12408.840514] [<80a88418>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[12408.840541] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[12408.840559] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[12408.840574] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[12408.840596] [<80233190>] (filemap_write_and_wait_range) from [<803db254>] (nfs_file_fsync+0x30/0x280)
[12408.840618] [<803db254>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[12408.840635] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[12408.840648] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[12408.840665] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
mkreisl commented 6 years ago

Same here, nothing changed. Still absolutely unstable, unreliable and completely unusable, the Pi3B+

JamesH65 commented 6 years ago

Odd, got one on my desk that is working fine. I think you forget to add "In the circumstances I am using it".

Anyway, issues still being looked at both here and at Microchip. There was a patch on the linux netdev list today for this chips driver (lan78xxx) for EEE which may well help, that will need to be tried. It's not like we are just sitting here twiddling our thumbs.

mkreisl commented 6 years ago

Anyway, issues still being looked at both here and at Microchip. There was a patch on the linux netdev list today for this chips driver (lan78xxx) for EEE which may well help, that will need to be tried. It's not like we are just sitting here twiddling our thumbs.

Seems you're getting fire under your a.. now :smile:

IMO you're looking at the wrong place. LAN issues are only the top of the iceberg

I was already reporting, that system is still unstable after that dump microchip is powered off and all traffic is going over wlan device. System still freezing randomly. So, before I'm better informed, I would say the hole Pi3B+ design is a huge issue

pelwell commented 6 years ago

Some users who reported problems (and there honestly haven't been that many, but they are shouting loudly) have had success with adding sdram_freq=450 to config.txt. I would recommend anybody with stability problems (anything not obviously network related) to do the same.

mkreisl commented 6 years ago

Some users who reported problems (and there honestly haven't been that many, but they are shouting loudly) have had success with adding sdram_freq=450 to config.txt. I would recommend anybody with stability problems (anything not obviously network related) to do the same.

What's the default for Pi3B+. Cant find it here

pelwell commented 6 years ago

500 turbo, 400 normal

graysky2 commented 6 years ago

For reference, here is the script to automate the replicate compile jobs.

@pelwell - I have some hard data now. I ran the make benchmark writing out to the NFS share under 2 conditions, once with eee disabled and once with it enabled. There is a clear trend: eee is causing problems.

Running make zImage

Here are 9 or 10 replicates running make zImage with all times reported in minutes.

% histogram.py < eee_on_zimage 
# NumSamples = 9; Min = 9.77; Max = 29.07
# Mean = 18.573025; Variance = 86.102905; SD = 9.279165; Median 10.764777

vs

% histogram.py < eee_off_zimage
# NumSamples = 10; Min = 9.91; Max = 10.87
# Mean = 10.178291; Variance = 0.067048; SD = 0.258936; Median 10.166035

Several trends from these data:

Running make modules

Here are 9 or 10 replicates running make modules with all times reported in minutes.

% histogram.py < eee_on_modules 
# NumSamples = 9; Min = 25.21; Max = 67.19
# Mean = 51.765753; Variance = 218.212739; SD = 14.772026; Median 46.494882

vs

% histogram.py < eee_off_modules
# NumSamples = 9; Min = 26.33; Max = 49.60
# Mean = 33.328529; Variance = 42.429103; SD = 6.513763; Median 32.126122

The same trends from these data:

I am happy to test future patches/firmware, whatever to help optimize this. I think the make zImage benchmark will be sufficient for this since it's way faster than make modules and gives similar results. Just let me know.

EDIT: I see @popcornmix pushed https://github.com/raspberrypi/firmware/commit/3aa806091dc57b757790b026c01754883cee2abc a few hours ago... time to retest?

graysky2 commented 6 years ago

@mkreisl - Please keep this issue on task... it's scoped for network writes not for general stability. Open a new task for that.

mkreisl commented 6 years ago

@graysky2 Oops, sorry for tainting your thread

graysky2 commented 6 years ago

A potential work-around: don't totally disable EEE, but set dtparam=tx_lpi_timer=10000 in /boot/config.txt which I did and found nearly identical results in the make zImage benchmark to having EEE totally disabled.

Again, values reported are compile times in minutes.

dtparam=tx_lpi_timer=10000

# NumSamples = 12; Min = 9.90; Max = 10.19
# Mean = 10.089245; Variance = 0.007412; SD = 0.086094; Median 10.119596

dtparam=eee=off

# NumSamples = 10; Min = 9.91; Max = 10.87
# Mean = 10.178291; Variance = 0.067048; SD = 0.258936; Median 10.166035

EDIT: see https://github.com/raspberrypi/linux/issues/2482#issuecomment-379551481 which demonstrates that the problem is still present.

mkreisl commented 6 years ago

All those EEE settings doesn't help for me, because my router/switch does not support EEE (most router with integrated switch does not support it) and I'm still getting nfs timeouts even if EEE is completely disabled, or I'm getting

Apr  6 16:11:14 kmxbilr2 kernel: [  837.345227] CIFS VFS: sends on sock aa2921c0 stuck for 15 seconds
Apr  6 16:11:14 kmxbilr2 kernel: [  837.345261] CIFS VFS: Error -11 sending data on socket to server
Apr  6 16:11:30 kmxbilr2 kernel: [  852.705497] CIFS VFS: sends on sock aa2921c0 stuck for 15 seconds
Apr  6 16:11:30 kmxbilr2 kernel: [  852.705532] CIFS VFS: Error -11 sending data on socket to server
Apr  6 16:11:30 kmxbilr2 kernel: [  852.833704] CIFS VFS: Free previous auth_key.response = 99685c00
Apr  6 16:11:55 kmxbilr2 kernel: [  878.305932] CIFS VFS: sends on sock aa29c380 stuck for 15 seconds
Apr  6 16:11:55 kmxbilr2 kernel: [  878.305972] CIFS VFS: Error -11 sending data on socket to server
Apr  6 16:12:11 kmxbilr2 kernel: [  893.666123] CIFS VFS: sends on sock aa29c380 stuck for 15 seconds
Apr  6 16:12:11 kmxbilr2 kernel: [  893.666156] CIFS VFS: Error -11 sending data on socket to server
Apr  6 16:12:26 kmxbilr2 kernel: [  909.026351] CIFS VFS: sends on sock aa29c380 stuck for 15 seconds
Apr  6 16:12:26 kmxbilr2 kernel: [  909.026382] CIFS VFS: Error -11 sending data on socket to server
Apr  6 16:12:41 kmxbilr2 kernel: [  924.386541] CIFS VFS: sends on sock aa29c380 stuck for 15 seconds
Apr  6 16:12:41 kmxbilr2 kernel: [  924.386573] CIFS VFS: Error -11 sending data on socket to server
Apr  6 16:12:41 kmxbilr2 kernel: [  924.484318] CIFS VFS: Free previous auth_key.response = a7910f00

if using SAMBA mount instead of NFS mount and after some time process that writes to share stucks and becomes uninterruptable 'D' state forever

graysky2 commented 6 years ago

@mkreisl - Are you booted into the same kernel and are you using the same firmware commit that I am? Kernel: https://github.com/raspberrypi/linux/commit/c2eb30683b43b13b931bd9cfef6a2a09ac7b7c1e Firmware: https://github.com/raspberrypi/firmware/commit/0dff9ecc00d8114164bb9a73799793dad24d8593

mkreisl commented 6 years ago

@graysky2 Kernel: yes (XBian built based on bcm2709_defconfig) Firmware: yes, excatly the same version

graysky2 commented 6 years ago

@mkreisl - not sure what to say then.... perhaps you have a different issue. As a control, have you tried the same stuff with another older RPi? Like a 2 or 3?

mkreisl commented 6 years ago

@graysky2 Sure, I'm running same procedure on Pi1, 2 and 3 (without +) since years without any problem.

mkreisl commented 6 years ago

@graysky2 In short words, I can explain what it does

1) mount network share (sshfs, nfs or samba) 2) create image on this share, big enough to backup data from root/boot fs into it 3) create partition in image (vfat for boot, btrfs for root) 4) copy boot partition into mounted image (loop device) 5) copy all subvolumes into mounted image (using btrfs send/receive or tar, both tested) 6) close everything and umount share

From 1 to 4 it always works, and within 5 it stucks always, but not on the same subvolume And, it does not matter if source fs is on sd, usb disk or iSCSI target

graysky2 commented 6 years ago

What is your NFS or samba server? How are these configured? I can complete these steps on mine without error except I don't do much with BTRFS. Can you complete the steps with ext4? I can:

# mount nfs share
mount -t nfs4 ease:/scratch /opt/armv7h/scratch
mount | grep opt
ease:/scratch on /opt/armv7/scratch type nfs4 
(rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.9.8.108,local_lock=none,addr=10.9.8.128)

# make image and format it
cd /opt/armv7h/scratch
truncate -s 5G image.img
mkfs.ext4 -F image.img
mkdir mp

# mount it
mount -o loop image.img mp
chown facade:users mp

# add 1.0 GB of data to image
dd if=/dev/zero of=/opt/armv7/scratch/mp/zerofill bs=4M count=250 status=progress

ls -l mp
total 1024020
drwx------ 2 facade users      16384 Apr  7 12:30 lost+found
-rw-r--r-- 1 facade users 1048576000 Apr  7 12:32 zerofill

umount mp
mkreisl commented 6 years ago

What is your NFS or samba server? How are these configured? I can complete these steps on mine without error except I don't do much with BTRFS. Can you complete the steps with ext4?

Running Debian Stretch as NFS and SAMBA server. Nothing special configured AFAIR Currently it's not possible to run tests with ext4, don't have any configuation based on ext4. Will do more testing next days, actually I'm testing wlan only configuration.

pelwell commented 6 years ago

You haven't said if you have changed sdram_freq or arm_freq yet.

mkreisl commented 6 years ago

@pelwell

You haven't said if you have changed sdram_freq or arm_freq yet.

Ehm @graysky2 told me that I should not post such topics here :disappointed:

Yes, changing sdram clock to 450 did it, setting arm clock to 1200 I tested 2 weeks ago without success

But, reducing sdram clock did not solve ethernet network issues discussed here

graysky2 commented 6 years ago

Ehm @graysky2 told me that I should not post such topics here disappointed

No, just reminded you that this task is about network write issues, if one of the developers asks you to try something, do it, just don't add a bunch of unrelated issues on top of this one.

pelwell commented 6 years ago

Exactly. There appear to be two classes of errors - network problems and memory corruption. I'm satisfied this is purely a networking problem.

EEE does seem to worsen network throughout, enabling flow control improves it, and we haven't ruled out further driver problems.

graysky2 commented 6 years ago

Based on @mkreisl's observation about samba shares, I did the same make zImage benchmark on a samba share (samba version 4.7.6) and found inconsistent performance (nothing in dmesg) though. This is with dtparam=tx_lpi_timer=10000 set:

# NumSamples = 4; Min = 462.65; Max = 1251.20
# Mean = 895.113710; Variance = 124647.496919; SD = 353.054524; Median 933.305226

1,zImage,631.770020346
2,zImage,462.645222405
3,zImage,1251.199164634
4,zImage,1234.840431102

After I got this result, I switched over the NFS and found highly consistent runs as before. Not sure what to make of it.

@mkreisl - You game to try the benchmark on your system? You just need a kernel source that contains some .config ready to go. See here.

mkreisl commented 6 years ago

@graysky2 modified your script as follows and let it run

#!/bin/bash

cd /srv/backup
mount | grep /srv/backup

truncate -s 8G image.img
mkfs.ext4 -F image.img
mkdir -p img

mount -o loop image.img img

to='--warning=none --acls --xattrs'
tar c --one-file-system $to -C / . | tar x $to -v -C /srv/backup/img

ls -l img

umount img && rmdir img && rm -f image.img

Destination is ext4, but source is still BTRFS

It ends up always with nfs: server kmcubie not responding, timed out

[ 1692.649145] INFO: task loop0:9479 blocked for more than 120 seconds.
[ 1692.649159]       Tainted: G         C      4.14.32+ #1
[ 1692.649167] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1692.649175] loop0           D    0  9479      2 0x00000000
[ 1692.649219] [<808c4124>] (__schedule) from [<808c47d4>] (schedule+0x58/0xb8)
[ 1692.649240] [<808c47d4>] (schedule) from [<8014be98>] (io_schedule+0x20/0x40)
[ 1692.649262] [<8014be98>] (io_schedule) from [<80221fb8>] (wait_on_page_bit+0x108/0x128)
[ 1692.649279] [<80221fb8>] (wait_on_page_bit) from [<802220bc>] (__filemap_fdatawait_range+0xe4/0x114)
[ 1692.649297] [<802220bc>] (__filemap_fdatawait_range) from [<80224614>] (filemap_write_and_wait_range+0x68/0xa4)
[ 1692.649316] [<80224614>] (filemap_write_and_wait_range) from [<803a0184>] (nfs_file_fsync+0x3c/0x308)
[ 1692.649335] [<803a0184>] (nfs_file_fsync) from [<802c7644>] (vfs_fsync_range+0x68/0xbc)
[ 1692.649356] [<802c7644>] (vfs_fsync_range) from [<802c76cc>] (vfs_fsync+0x34/0x3c)
[ 1692.649383] [<802c76cc>] (vfs_fsync) from [<7f007e80>] (loop_queue_work+0x574/0xb98 [loop])
[ 1692.649438] [<7f007e80>] (loop_queue_work [loop]) from [<8013dff4>] (kthread_worker_fn+0xbc/0x1e0)
[ 1692.649459] [<8013dff4>] (kthread_worker_fn) from [<7f005738>] (loop_kthread_worker_fn+0x34/0x194 [loop])
[ 1692.649480] [<7f005738>] (loop_kthread_worker_fn [loop]) from [<8013dec4>] (kthread+0x13c/0x16c)
[ 1692.649502] [<8013dec4>] (kthread) from [<8010834c>] (ret_from_fork+0x14/0x28)
[ 1784.809482] nfs: server kmcubie not responding, timed out
[ 1784.809540] nfs: server kmcubie not responding, timed out
[ 1784.809563] nfs: server kmcubie not responding, timed out
[ 1813.481779] INFO: task kworker/u8:2:7543 blocked for more than 120 seconds.
[ 1813.481794]       Tainted: G         C      4.14.32+ #1
[ 1813.481801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1813.481810] kworker/u8:2    D    0  7543      2 0x00000000
[ 1813.481844] Workqueue: writeback wb_workfn (flush-7:0)
[ 1813.481885] [<808c4124>] (__schedule) from [<808c47d4>] (schedule+0x58/0xb8)
[ 1813.481903] [<808c47d4>] (schedule) from [<8014be98>] (io_schedule+0x20/0x40)
[ 1813.481923] [<8014be98>] (io_schedule) from [<805688c8>] (blk_mq_get_tag+0x1a4/0x2fc)
[ 1813.481944] [<805688c8>] (blk_mq_get_tag) from [<805634e8>] (blk_mq_get_request+0x134/0x43c)
[ 1813.481961] [<805634e8>] (blk_mq_get_request) from [<805661a8>] (blk_mq_make_request+0xec/0x758)
[ 1813.481980] [<805661a8>] (blk_mq_make_request) from [<80558d50>] (generic_make_request+0xe8/0x260)
[ 1813.481999] [<80558d50>] (generic_make_request) from [<80558f80>] (submit_bio+0xb8/0x1a8)
[ 1813.482018] [<80558f80>] (submit_bio) from [<8035dec8>] (ext4_io_submit+0x58/0x64)
[ 1813.482038] [<8035dec8>] (ext4_io_submit) from [<80342b6c>] (ext4_writepages+0x718/0x998)
[ 1813.482056] [<80342b6c>] (ext4_writepages) from [<80234668>] (do_writepages+0x30/0x9c)
[ 1813.482074] [<80234668>] (do_writepages) from [<802c20d8>] (__writeback_single_inode+0x44/0x554)
[ 1813.482092] [<802c20d8>] (__writeback_single_inode) from [<802c2b74>] (writeback_sb_inodes+0x238/0x4c0)
[ 1813.482111] [<802c2b74>] (writeback_sb_inodes) from [<802c2e8c>] (__writeback_inodes_wb+0x90/0xd0)
[ 1813.482128] [<802c2e8c>] (__writeback_inodes_wb) from [<802c31b8>] (wb_writeback+0x2ec/0x3bc)
[ 1813.482144] [<802c31b8>] (wb_writeback) from [<802c3c24>] (wb_workfn+0x1d4/0x570)
[ 1813.482167] [<802c3c24>] (wb_workfn) from [<80137b18>] (process_one_work+0x140/0x4c4)
[ 1813.482187] [<80137b18>] (process_one_work) from [<80137ef8>] (worker_thread+0x5c/0x568)
[ 1813.482206] [<80137ef8>] (worker_thread) from [<8013dec4>] (kthread+0x13c/0x16c)
[ 1813.482226] [<8013dec4>] (kthread) from [<8010834c>] (ret_from_fork+0x14/0x28)
[ 1813.482258] INFO: task jbd2/loop0-8:9482 blocked for more than 120 seconds.
[ 1813.482266]       Tainted: G         C      4.14.32+ #1
[ 1813.482272] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1813.482278] jbd2/loop0-8    D    0  9482      2 0x00000000
[ 1813.482303] [<808c4124>] (__schedule) from [<808c47d4>] (schedule+0x58/0xb8)
[ 1813.482318] [<808c47d4>] (schedule) from [<8014be98>] (io_schedule+0x20/0x40)
[ 1813.482333] [<8014be98>] (io_schedule) from [<808c525c>] (bit_wait_io+0x1c/0x70)
[ 1813.482350] [<808c525c>] (bit_wait_io) from [<808c4ee8>] (__wait_on_bit+0x94/0xd4)
[ 1813.482368] [<808c4ee8>] (__wait_on_bit) from [<808c4fb0>] (out_of_line_wait_on_bit+0x88/0x90)
[ 1813.482385] [<808c4fb0>] (out_of_line_wait_on_bit) from [<802c988c>] (__wait_on_buffer+0x34/0x38)
[ 1813.482404] [<802c988c>] (__wait_on_buffer) from [<803838a0>] (jbd2_journal_commit_transaction+0x145c/0x191c)
[ 1813.482428] [<803838a0>] (jbd2_journal_commit_transaction) from [<80388e40>] (kjournald2+0x110/0x2a0)
[ 1813.482446] [<80388e40>] (kjournald2) from [<8013dec4>] (kthread+0x13c/0x16c)
[ 1813.482464] [<8013dec4>] (kthread) from [<8010834c>] (ret_from_fork+0x14/0x28)
[ 1813.482477] INFO: task tar:9486 blocked for more than 120 seconds.
[ 1813.482487]       Tainted: G         C      4.14.32+ #1
[ 1813.482493] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1813.482500] tar             D    0  9486   9454 0x00000000
[ 1813.482523] [<808c4124>] (__schedule) from [<808c47d4>] (schedule+0x58/0xb8)
[ 1813.482538] [<808c47d4>] (schedule) from [<8014be98>] (io_schedule+0x20/0x40)
[ 1813.482553] [<8014be98>] (io_schedule) from [<805688c8>] (blk_mq_get_tag+0x1a4/0x2fc)
[ 1813.482570] [<805688c8>] (blk_mq_get_tag) from [<805634e8>] (blk_mq_get_request+0x134/0x43c)
[ 1813.482586] [<805634e8>] (blk_mq_get_request) from [<805661a8>] (blk_mq_make_request+0xec/0x758)
[ 1813.482602] [<805661a8>] (blk_mq_make_request) from [<80558d50>] (generic_make_request+0xe8/0x260)
[ 1813.482619] [<80558d50>] (generic_make_request) from [<80558f80>] (submit_bio+0xb8/0x1a8)
[ 1813.482635] [<80558f80>] (submit_bio) from [<802cca70>] (submit_bh_wbc+0x190/0x1c0)
[ 1813.482653] [<802cca70>] (submit_bh_wbc) from [<802cd268>] (submit_bh+0x24/0x2c)
[ 1813.482672] [<802cd268>] (submit_bh) from [<8031fe1c>] (ext4_read_block_bitmap_nowait+0x354/0x80c)
[ 1813.482691] [<8031fe1c>] (ext4_read_block_bitmap_nowait) from [<8034b544>] (ext4_mb_init_cache+0x128/0x750)
[ 1813.482709] [<8034b544>] (ext4_mb_init_cache) from [<8034bd2c>] (ext4_mb_init_group+0x1c0/0x2b0)
[ 1813.482725] [<8034bd2c>] (ext4_mb_init_group) from [<8034bfb4>] (ext4_mb_good_group+0x198/0x1a8)
[ 1813.482744] [<8034bfb4>] (ext4_mb_good_group) from [<8034ee20>] (ext4_mb_regular_allocator+0x2b8/0x408)
[ 1813.482761] [<8034ee20>] (ext4_mb_regular_allocator) from [<80351124>] (ext4_mb_new_blocks+0x854/0xe8c)
[ 1813.482781] [<80351124>] (ext4_mb_new_blocks) from [<803299b0>] (ext4_ext_map_blocks+0xccc/0x13b8)
[ 1813.482799] [<803299b0>] (ext4_ext_map_blocks) from [<8033e0e0>] (ext4_map_blocks+0x114/0x600)
[ 1813.482814] [<8033e0e0>] (ext4_map_blocks) from [<8033f0e8>] (ext4_getblk+0x4c/0x184)
[ 1813.482829] [<8033f0e8>] (ext4_getblk) from [<8033f23c>] (ext4_bread+0x1c/0xac)
[ 1813.482845] [<8033f23c>] (ext4_bread) from [<80356cf4>] (ext4_append+0x68/0x10c)
[ 1813.482863] [<80356cf4>] (ext4_append) from [<8035b050>] (ext4_mkdir+0x210/0x414)
[ 1813.482883] [<8035b050>] (ext4_mkdir) from [<8029e14c>] (vfs_mkdir+0x100/0x154)
[ 1813.482908] [<8029e14c>] (vfs_mkdir) from [<802a261c>] (SyS_mkdirat+0x74/0xc0)
[ 1813.482927] [<802a261c>] (SyS_mkdirat) from [<801082a0>] (ret_fast_syscall+0x0/0x28)

But it does not always block everything infinitely bc mounted loop device will randomly become ro (after a looong time doing nothing) and tar does not continue copying, just throws a lot of error messages to the screen or tar continues copying until nfs is blocked again or third case is blocking infinitely

root@kmxbilr2 ~/bin # ps ax | grep tar\ 
13829 pts/3    S+     0:18 tar c --one-file-system --warning=none --acls --xattrs -C / .
13830 pts/3    D+     0:20 tar x --warning=none --acls --xattrs -C /srv/backup/img
15679 pts/1    S+     0:00 grep --color=auto tar 
root@kmxbilr2 ~/bin # 

This is my eee setting:

root@kmxbilr2 ~/bin # ethtool --show-eee eth0
EEE Settings for eth0:
        EEE status: disabled
        Tx LPI: disabled
        Supported EEE link modes:  100baseT/Full 
                                   1000baseT/Full 
        Advertised EEE link modes:  Not reported
        Link partner advertised EEE link modes:  Not reported

root@kmxbilr2 ~/bin # 
mkreisl commented 6 years ago

So, I run above test three times:

1st run: loop device becomes r/o 2nd run: test finished 3rd run: test blocked infinitely

but getting always messages like this: NFO: task loop0:9479 blocked for more than 120 seconds.

Ah, forgot, destination folder is mouted as follows

kmcubie:/media/system/backup/kmxbilr2 on /srv/backup type nfs4 (rw,nosuid,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.24,local_lock=none,addr=192.168.1.6)

Mounted via autofs

EDIT:

After 45min of blocking loop fs becomes r/o in 3rd run and script finished

graysky2 commented 6 years ago

Destination is ext4, but source is still BTRFS It ends up always with nfs: server kmcubie not responding, timed out

I don't use BTRFS nor do I have access to your specific filesystem, so let's come up with a standard script anyone can use provided that they have an NFS server and can mount to it.

Here is a such a script.

Experiment

Run this script on the RPi3B+ and using the identical micro SD card, on known-good board which in my case is a RPi2.

Details

Conclusion

@pelwell

Details: RPi3B+

% mount | grep nfs
ease:/scratch on /scratch type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.9.8.108,local_lock=none,addr=10.9.8.128)

Output of the script:

# /root/bin/nfstest 
Writting round 1
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 605 s, 6.9 MB/s 
1000+0 records in
1000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 605.129 s, 6.9 MB/s
total 4096020
drwx------ 2 root root      16384 Apr  8 12:54 lost+found
-rw-r--r-- 1 root root 4194304000 Apr  8 13:04 rnd.1
Round 1 done
Writting round 2
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 930 s, 4.5 MB/s 
1000+0 records in
1000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 929.514 s, 4.5 MB/s
total 4096020
drwx------ 2 root root      16384 Apr  8 13:05 lost+found
-rw-r--r-- 1 root root 4194304000 Apr  8 13:20 rnd.2
Round 2 done
Writting round 3
2990538752 bytes (3.0 GB, 2.8 GiB) copied, 679 s, 4.4 MB/s 

It is currently stuck on round 3. System is still up but not very responsive. From dmsg:

[   90.991309] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
[  699.983054] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
[  858.079796] INFO: task jbd2/loop0-8:633 blocked for more than 120 seconds.
[  858.079980]       Tainted: G         C      4.14.32-2-ARCH #1
[  858.080112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  858.080290] jbd2/loop0-8    D    0   633      2 0x00000000
[  858.080451] [<80a87c48>] (__schedule) from [<80a88418>] (schedule+0x3c/0xa0)
[  858.080627] [<80a88418>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[  858.080801] [<8015c138>] (io_schedule) from [<80a88d28>] (bit_wait_io+0xc/0x60)
[  858.080975] [<80a88d28>] (bit_wait_io) from [<80a8899c>] (__wait_on_bit+0x60/0xb8)
[  858.081158] [<80a8899c>] (__wait_on_bit) from [<80a88a68>] (out_of_line_wait_on_bit+0x74/0xa0)
[  858.085451] [<80a88a68>] (out_of_line_wait_on_bit) from [<803be5b4>] (jbd2_journal_commit_transaction+0xf6c/0x17d8)
[  858.094108] [<803be5b4>] (jbd2_journal_commit_transaction) from [<803c3ea4>] (kjournald2+0xe0/0x2e4)
[  858.102800] [<803c3ea4>] (kjournald2) from [<8014e708>] (kthread+0x138/0x150)
[  858.107318] [<8014e708>] (kthread) from [<80107db8>] (ret_from_fork+0x14/0x3c)
[ 1633.827852] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)

Details: RPi2

Output of the script:

# /root/bin/nfstest
Writting round 1
4185915392 bytes (4.2 GB, 3.9 GiB) copied, 373 s, 11.2 MB/s
1000+0 records in
1000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 373.4 s, 11.2 MB/s
total 4096020
drwx------ 2 root root      16384 Apr  8 09:44 lost+found
-rw-r--r-- 1 root root 4194304000 Apr  8 09:51 rnd.1
Round 1 done
Writting round 2
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 373 s, 11.3 MB/s
1000+0 records in
1000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 372.777 s, 11.3 MB/s
total 4096020
drwx------ 2 root root      16384 Apr  8 09:51 lost+found
-rw-r--r-- 1 root root 4194304000 Apr  8 09:57 rnd.2
Round 2 done
Writting round 3
4190109696 bytes (4.2 GB, 3.9 GiB) copied, 369 s, 11.3 MB/s
1000+0 records in
1000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 369.738 s, 11.3 MB/s
total 4096020
drwx------ 2 root root      16384 Apr  8 09:57 lost+found
-rw-r--r-- 1 root root 4194304000 Apr  8 10:03 rnd.3
Round 3 done
Writting round 4
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 368 s, 11.4 MB/s
1000+0 records in
1000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 367.684 s, 11.4 MB/s
total 4096020
drwx------ 2 root root      16384 Apr  8 10:04 lost+found
-rw-r--r-- 1 root root 4194304000 Apr  8 10:10 rnd.4
Round 4 done
nfstest  0.64s user 157.45s system 10% cpu 25:36.18 total

From dmesg:

[   26.240546] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
[  411.226606] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
[  759.730166] kworker/0:1H: page allocation failure: order:0, mode:0x1080020(GFP_ATOMIC), nodemask=(null)
[  759.739788] kworker/0:1H cpuset=/ mems_allowed=0
[  759.744512] CPU: 0 PID: 196 Comm: kworker/0:1H Tainted: G         C      4.14.32-2-ARCH #1
[  759.752917] Hardware name: BCM2835
[  759.756391] Workqueue: xprtiod xs_tcp_data_receive_workfn
[  759.761905] [<8010ef14>] (unwind_backtrace) from [<8010bbc0>] (show_stack+0x10/0x14)
[  759.769790] [<8010bbc0>] (show_stack) from [<80a71e3c>] (dump_stack+0x9c/0xc8)
[  759.777144] [<80a71e3c>] (dump_stack) from [<8023bea4>] (warn_alloc+0xd0/0x194)
[  759.784584] [<8023bea4>] (warn_alloc) from [<8023cf98>] (__alloc_pages_nodemask+0xf98/0x10ec)
[  759.817701] [<8023cf98>] (__alloc_pages_nodemask) from [<8023d2f0>] (page_frag_alloc+0x168/0x198)
[  759.877932] [<8023d2f0>] (page_frag_alloc) from [<8096695c>] (__netdev_alloc_skb+0x90/0x10c)
[  759.938928] [<8096695c>] (__netdev_alloc_skb) from [<8080d36c>] (rx_submit+0x34/0x214)
[  760.001476] [<8080d36c>] (rx_submit) from [<8080d8fc>] (rx_complete+0x138/0x220)
[  760.065467] [<8080d8fc>] (rx_complete) from [<80819888>] (__usb_hcd_giveback_urb+0x98/0x154)
[  760.131215] [<80819888>] (__usb_hcd_giveback_urb) from [<808424f0>] (completion_tasklet_func+0x78/0xbc)
[  760.198174] [<808424f0>] (completion_tasklet_func) from [<80134f44>] (tasklet_hi_action+0x64/0xf8)
[  760.264776] [<80134f44>] (tasklet_hi_action) from [<80101574>] (__do_softirq+0xc4/0x328)
[  760.331823] [<80101574>] (__do_softirq) from [<801349e8>] (do_softirq.part.4+0x48/0x50)
[  760.400468] [<801349e8>] (do_softirq.part.4) from [<80134aa8>] (__local_bh_enable_ip+0xb8/0xd8)
[  760.471671] [<80134aa8>] (__local_bh_enable_ip) from [<80a47024>] (xs_tcp_data_receive_workfn+0x88/0x194)
[  760.543912] [<80a47024>] (xs_tcp_data_receive_workfn) from [<801487f0>] (process_one_work+0x210/0x4fc)
[  760.617849] [<801487f0>] (process_one_work) from [<80149718>] (worker_thread+0x34/0x5d0)
[  760.691146] [<80149718>] (worker_thread) from [<8014e708>] (kthread+0x138/0x150)
[  760.763687] [<8014e708>] (kthread) from [<80107db8>] (ret_from_fork+0x14/0x3c)
[  760.803499] Mem-Info:
[  760.838097] active_anon:2032 inactive_anon:2748 isolated_anon:0
                active_file:8207 inactive_file:204480 isolated_file:0
                unevictable:0 dirty:8517 writeback:28663 unstable:2
                slab_reclaimable:6216 slab_unreclaimable:3548
                mapped:4648 shmem:56 pagetables:200 bounce:0
                free:4508 free_pcp:242 free_cma:1299
[  761.056841] Node 0 active_anon:8128kB inactive_anon:10992kB active_file:32828kB inactive_file:817920kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:18592kB dirty:34068kB writeback:114652kB shmem:224kB writeback_tmp:0kB unstable:8kB all_unreclaimable? no
[  761.198936] Normal free:18032kB min:3884kB low:4852kB high:5820kB active_anon:8128kB inactive_anon:10992kB active_file:32828kB inactive_file:817948kB unevictable:0kB writepending:148628kB present:983040kB managed:958392kB mlocked:0kB kernel_stack:896kB pagetables:800kB bounce:0kB free_pcp:968kB local_pcp:108kB free_cma:5196kB
[  761.376256] lowmem_reserve[]: 0 0
[  761.408452] Normal: 95*4kB (UMEC) 155*8kB (UMEHC) 152*16kB (UME) 218*32kB (MEH) 15*64kB (UMC) 2*128kB (MH) 0*256kB 1*512kB (H) 1*1024kB (C) 0*2048kB 1*4096kB (C) = 17876kB
[  761.509861] 212750 total pagecache pages
[  761.541672] 0 pages in swap cache
[  761.572101] Swap cache stats: add 0, delete 0, find 0/0
[  761.603884] Free swap  = 1191420kB
[  761.633321] Total swap = 1191420kB
[  761.661755] 245760 pages RAM
[  761.688868] 0 pages HighMem/MovableOnly
[  761.716659] 6162 pages reserved
[  761.743400] 2048 pages cma reserved
[  796.738005] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
[ 1180.359713] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
mkreisl commented 6 years ago

@graysky2

When I run it on my 3B+, I see that the original bug I for which I opened this ticket is STILL presen

Can confirm this ...

4194304000 bytes (4.2 GB, 3.9 GiB) copied, 605 s, 6.9 MB/s

... and your network performance is ridiculous slow

btw, I'm getting better results if I turn off flow control completely

ethtool -A eth0 tx off rx off autoneg off && ethtool -r eth0

Infinite locks are gone, but of course, those annoying locks are still there

graysky2 commented 6 years ago

Well, the important thing is that you are independently able to reproduce this bug.

... and your network performance is ridiculous slow

Just on the 3B+ and I believe it is due to the locks ... the network is all cat6 going through an SG100-16. Here is an x86_64 box doing the same 4G write:

% dd if=/dev/zero of=/mnt/media/test/fill bs=4M count=1000 status=progress
4177526784 bytes (4.2 GB, 3.9 GiB) copied, 31 s, 135 MB/s
1000+0 records in
1000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 35.8983 s, 117 MB/s
lategoodbye commented 6 years ago

@graysky2 In order to identify a locking issue you could try to recompile your kernel with the following options:

Kernel hacking  --->
  Lock Debugging (spinlocks, mutexes, etc...)  --->
     Lock debugging: prove locking correctness (CONFIG_PROVE_LOCKING=y)
     Sleep inside atomic section checking (CONFIG_DEBUG_ATOMIC_SLEEP=y)
graysky2 commented 6 years ago

You bet. Building now. For reference, the config I am using.

graysky2 commented 6 years ago

@lategoodbye @pelwell - I never thought to see what what going on the nfs server's side of things. My NFS server is running on a i7-4790K machine. The dmesg log there gets spammed with hundreds of lines like this the below that coincide with freeze ups on the RPi3BPlus:

[ 3624.508074] nfsd4_validate_stateid: 119 callbacks suppressed
[ 3624.508074] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 3624.508938] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 3624.509754] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 3624.510577] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 3624.511396] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 3624.512229] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 3624.513059] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 3624.513872] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 3624.514688] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 3624.515514] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 3767.535787] nfsd4_validate_stateid: 119 callbacks suppressed
[ 3767.535788] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4105.423284] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4105.423887] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4105.424487] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4105.425385] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4105.425979] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4105.426585] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4105.427501] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4105.428322] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4105.429125] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4218.046249] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4218.047093] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
graysky2 commented 6 years ago

@lategoodbye - I don't think the kernel has what you need... I booted into it but get this on lock up:

[  532.399675] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
[  858.099808] INFO: task jbd2/loop0-8:859 blocked for more than 120 seconds.
[  858.100037]       Tainted: G        WC      4.14.32-3-ARCH #1
[  858.100215] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  858.100453] jbd2/loop0-8    D    0   859      2 0x00000000
[  858.100665] [<80baebf0>] (__schedule) from [<80baf72c>] (schedule+0x50/0xc0)
[  858.100892] [<80baf72c>] (schedule) from [<80168230>] (io_schedule+0x14/0x3c)
[  858.101124] [<80168230>] (io_schedule) from [<80bb00f4>] (bit_wait_io+0xc/0x60)
[  858.101354] [<80bb00f4>] (bit_wait_io) from [<80bafd68>] (__wait_on_bit+0x60/0xb8)
[  858.101594] [<80bafd68>] (__wait_on_bit) from [<80bafe34>] (out_of_line_wait_on_bit+0x74/0xa0)
[  858.101868] [<80bafe34>] (out_of_line_wait_on_bit) from [<80420fe0>] (jbd2_journal_commit_transaction+0x1470/0x1ef8)
[  858.102191] [<80420fe0>] (jbd2_journal_commit_transaction) from [<80426e70>] (kjournald2+0xdc/0x2d0)
[  858.102472] [<80426e70>] (kjournald2) from [<80157430>] (kthread+0x148/0x164)
[  858.102697] [<80157430>] (kthread) from [<8010824c>] (ret_from_fork+0x14/0x28)
[  858.102923] INFO: lockdep is turned off.
lategoodbye commented 6 years ago

@graysky2 This means you build the kernel with these options, but it didn't trigger any warning except of the hung? I tought it would be worth to try it.

graysky2 commented 6 years ago

@lategoodbye -

[ 858.102923] INFO: lockdep is turned off.

Perhaps I didn't compile with enough of these options enabled? Here is the nconfig... more?

                .config - Linux/arm 4.14.32-3 Kernel Configuration
 ┌── Lock Debugging (spinlocks, mutexes, etc...) ────────────────────────────────┐
 │                                                                               │
 │         -*- RT Mutex debugging, deadlock detection                            │
 │         -*- Spinlock and rw-lock debugging: basic checks                      │
 │         -*- Mutex debugging: basic checks                                     │
 │         [ ] Wait/wound mutex debugging: Slowpath testing                      │
 │         -*- Lock debugging: detect incorrect freeing of live locks            │
 │         [*] Lock debugging: prove locking correctness                         │
 │         [ ] Lock usage statistics                                             │
 │         [ ] Lock dependency engine debugging (NEW)                            │
 │         [*] Sleep inside atomic section checking                              │
 │         [ ] Locking API boot-time self-tests                                  │
 │         < > torture tests for locking                                         │
 │         < > Wait/wound mutex selftests                                        │
 │                                                                               │
lategoodbye commented 6 years ago

@graysky2 If you want to doublecheck it, search for "Lock dependency validator" within dmesg. The default RPI kernel doesn't have it.

Edit: The config above looks good. Edit2: lockdep is disabled because of the hung.

graysky2 commented 6 years ago

If you want to doublecheck it, search for "Lock dependency validator" within dmesg. The default RPI kernel doesn't have it.

It's there:

% dmesg | grep -i validator
[    0.004391] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar

Edit: The config above looks good. Edit2: lockdep is disabled because of the hung.

OK... so no need to recompile... and we are back to square 1 :(

lategoodbye commented 6 years ago

@graysky2 I plan to submit RPi 3+ support to Linux Mainline soon. Any chance for you to give it a try?