Closed graysky2 closed 6 years ago
An easy way to trigger this bug (if you don't want to try compiling the kernel package) is to simply use dd
to write out from /dev/zero
to the NFS mount. For example on my RPi3 B+:
# mount ease:/scratch /scratch-nfs
% dd if=/dev/zero of=/scratch-nfs/fill bs=4M count=1000 status=progress
964689920 bytes (965 MB, 920 MiB) copied, 149 s, 6.5 MB/s
<<< it froze up after about 965 MB written >>>
<<< In dmesg I get another server not responding error >>>
[ 5112.824818] nfs: server ease not responding, still trying
[ 5149.707808] nfs: server ease OK
Now, if I swap out the micro SD and boot into a RPi 2 I have lying around, same network cable, same power supply, and repeat the commands, everything works as expected. I think that helps to rule out the NFS server, network hardware etc. as potentially to blame.
# mount ease:/scratch /scratch-nfs
% dd if=/dev/zero of=/scratch-nfs/fill bs=4M count=1000 status=progress
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 346 s, 12.1 MB/s
1000+0 records in
1000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 357.595 s, 11.7 MB/s
dd if=/dev/zero of=/scratch-nfs/fill bs=4M count=1000 status=progress 0.00s user 24.47s system 5% cpu 8:03.99 total
Does disabling Energy Efficient Ethernet make a difference? Add dtparam=eee=off
to config.txt and reboot.
But before trying that you can confirm whether EEE is active using ethtool --show-eee eth0
.
Great suggestion, @pelwell. I got some very encouraging results using the dd
test which floods the I/O with a steady stream of data. It "passed" meaning no timeouts writing and no server not responding messages via dmesg. I am compiling the same package that consistently gives the errors now and will post back with those results.
Before:
# ethtool --show-eee eth0
EEE Settings for eth0:
EEE status: enabled - active
...
After:
# ethtool --show-eee eth0
EEE Settings for eth0:
EEE status: disabled
...
The test with dd:
# mount ease:/scratch /scratch-nfs
% dd if=/dev/zero of=/scratch-nfs/fill bs=4M count=1000 status=progress && rm fill
4169138176 bytes (4.2 GB, 3.9 GiB) copied, 97 s, 42.9 MB/s
1000+0 records in
1000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 100.665 s, 41.7 MB/s
dd if=/dev/zero of=/scratch-nfs/fill bs=4M count=1000 status=progress 0.00s user 13.79s system 13% cpu 1:40.68 total
% dd if=/dev/zero of=/scratch-nfs/fill bs=4M count=2000 status=progress && rm fill
8380219392 bytes (8.4 GB, 7.8 GiB) copied, 198 s, 42.3 MB/s
2000+0 records in
2000+0 records out
8388608000 bytes (8.4 GB, 7.8 GiB) copied, 201.245 s, 41.7 MB/s
dd if=/dev/zero of=/scratch-nfs/fill bs=4M count=2000 status=progress 0.00s user 27.98s system 13% cpu 3:21.25 total
% dd if=/dev/zero of=/scratch-nfs/fill bs=4M count=2000 status=progress && rm fill
8380219392 bytes (8.4 GB, 7.8 GiB) copied, 198 s, 42.3 MB/s
2000+0 records in
2000+0 records out
8388608000 bytes (8.4 GB, 7.8 GiB) copied, 201.052 s, 41.7 MB/s
dd if=/dev/zero of=/scratch-nfs/fill bs=4M count=2000 status=progress 0.00s user 28.23s system 13% cpu 3:22.19 total
Unfortunately, when compiling which as you can reconize, writes out data must less frequently than dd
does, I am experiencing the same errors:
[ 3315.685473] INFO: task gzip:29769 blocked for more than 120 seconds.
[ 3315.685636] Tainted: G C 4.14.31-1-ARCH #1
[ 3315.685767] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3315.685955] gzip D 0 29769 29767 0x00000000
[ 3315.686127] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 3315.686299] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 3315.686473] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 3315.686663] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 3315.686875] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[ 3315.687121] [<80233190>] (filemap_write_and_wait_range) from [<803db1c4>] (nfs_file_fsync+0x30/0x280)
[ 3315.687349] [<803db1c4>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[ 3315.687529] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[ 3315.691478] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[ 3315.695540] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 3402.725251] nfs: server ease not responding, still trying
[ 3403.765783] nfs: server ease not responding, still trying
[ 3404.089089] nfs: server ease OK
[ 3404.089297] nfs: server ease OK
[ 3899.364008] nfs: server ease not responding, still trying
[ 3899.364013] nfs: server ease not responding, still trying
[ 3899.364028] nfs: server ease not responding, still trying
[ 3899.364060] nfs: server ease not responding, still trying
[ 3899.364071] nfs: server ease not responding, still trying
[ 3899.364076] nfs: server ease not responding, still trying
[ 3930.084023] INFO: task ld:13616 blocked for more than 120 seconds.
[ 3930.087086] Tainted: G C 4.14.31-1-ARCH #1
[ 3930.090229] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 3930.096312] ld D 0 13616 13612 0x00000000
[ 3930.099422] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 3930.102523] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 3930.105566] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 3930.111351] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 3930.117264] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[ 3930.123049] [<80233190>] (filemap_write_and_wait_range) from [<803db1c4>] (nfs_file_fsync+0x30/0x280)
[ 3930.129036] [<803db1c4>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[ 3930.135044] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[ 3930.138283] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[ 3930.141618] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 3941.625186] nfs: server ease OK
[ 3941.625295] nfs: server ease OK
[ 3941.625441] nfs: server ease OK
[ 3941.625829] nfs: server ease OK
[ 3941.635332] nfs: server ease OK
[ 3941.635549] nfs: server ease OK
[ 4170.727338] nfs: server ease not responding, still trying
[ 4170.727343] nfs: server ease not responding, still trying
[ 4170.727356] nfs: server ease not responding, still trying
[ 4170.727395] nfs: server ease not responding, still trying
[ 4170.727413] nfs: server ease not responding, still trying
[ 4170.727428] nfs: server ease not responding, still trying
[ 4170.727441] nfs: server ease not responding, still trying
[ 4170.727455] nfs: server ease not responding, still trying
[ 4170.727461] nfs: server ease not responding, still trying
[ 4170.727467] nfs: server ease not responding, still trying
[ 4175.847588] INFO: task gzip:22430 blocked for more than 120 seconds.
[ 4175.849590] Tainted: G C 4.14.31-1-ARCH #1
[ 4175.851594] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4175.855516] gzip D 0 22430 22391 0x00000000
[ 4175.857549] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 4175.859576] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 4175.861543] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 4175.865280] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 4175.869533] [<80230d0c>] (__filemap_fdatawait_range) from [<80230d70>] (filemap_fdatawait_range+0x18/0x28)
[ 4175.874352] [<80230d70>] (filemap_fdatawait_range) from [<802330f4>] (filemap_write_and_wait+0x58/0x7c)
[ 4175.879764] [<802330f4>] (filemap_write_and_wait) from [<803ea028>] (nfs_wb_all+0x14/0x15c)
[ 4175.885618] [<803ea028>] (nfs_wb_all) from [<803dd96c>] (nfs_setattr+0x280/0x2a4)
[ 4175.892223] [<803dd96c>] (nfs_setattr) from [<802bf8d4>] (notify_change+0x17c/0x410)
[ 4175.899511] [<802bf8d4>] (notify_change) from [<802d62fc>] (utimes_common+0xbc/0x188)
[ 4175.907605] [<802d62fc>] (utimes_common) from [<802d64c8>] (do_utimes+0x100/0x144)
[ 4175.916359] [<802d64c8>] (do_utimes) from [<802d6548>] (SyS_utimensat+0x3c/0xb0)
[ 4175.925462] [<802d6548>] (SyS_utimensat) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 4175.934598] INFO: task cp:22444 blocked for more than 120 seconds.
[ 4175.939378] Tainted: G C 4.14.31-1-ARCH #1
[ 4175.944179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4175.953867] cp D 0 22444 22422 0x00000000
[ 4175.958731] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 4175.963485] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 4175.968194] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 4175.977469] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 4175.986837] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[ 4175.996147] [<80233190>] (filemap_write_and_wait_range) from [<803db1c4>] (nfs_file_fsync+0x30/0x280)
[ 4176.005282] [<803db1c4>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[ 4176.014199] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[ 4176.018723] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[ 4176.023212] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 4176.031736] INFO: task gzip:22446 blocked for more than 120 seconds.
[ 4176.036150] Tainted: G C 4.14.31-1-ARCH #1
[ 4176.040488] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4176.048923] gzip D 0 22446 22413 0x00000000
[ 4176.053120] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 4176.057319] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 4176.061492] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 4176.069486] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 4176.077824] [<80230d0c>] (__filemap_fdatawait_range) from [<80230d70>] (filemap_fdatawait_range+0x18/0x28)
[ 4176.086158] [<80230d70>] (filemap_fdatawait_range) from [<802330f4>] (filemap_write_and_wait+0x58/0x7c)
[ 4176.094434] [<802330f4>] (filemap_write_and_wait) from [<803ea028>] (nfs_wb_all+0x14/0x15c)
[ 4176.102722] [<803ea028>] (nfs_wb_all) from [<803dd96c>] (nfs_setattr+0x280/0x2a4)
[ 4176.111286] [<803dd96c>] (nfs_setattr) from [<802bf8d4>] (notify_change+0x17c/0x410)
[ 4176.119906] [<802bf8d4>] (notify_change) from [<802d62fc>] (utimes_common+0xbc/0x188)
[ 4176.128677] [<802d62fc>] (utimes_common) from [<802d64c8>] (do_utimes+0x100/0x144)
[ 4176.137604] [<802d64c8>] (do_utimes) from [<802d6548>] (SyS_utimensat+0x3c/0xb0)
[ 4176.146598] [<802d6548>] (SyS_utimensat) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 4176.155652] INFO: task gzip:22448 blocked for more than 120 seconds.
[ 4176.160374] Tainted: G C 4.14.31-1-ARCH #1
[ 4176.165034] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4176.174526] gzip D 0 22448 22399 0x00000000
[ 4176.179330] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 4176.183995] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 4176.188703] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 4176.197969] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 4176.207204] [<80230d0c>] (__filemap_fdatawait_range) from [<80230d70>] (filemap_fdatawait_range+0x18/0x28)
[ 4176.216304] [<80230d70>] (filemap_fdatawait_range) from [<802330f4>] (filemap_write_and_wait+0x58/0x7c)
[ 4176.225376] [<802330f4>] (filemap_write_and_wait) from [<803ea028>] (nfs_wb_all+0x14/0x15c)
[ 4176.234297] [<803ea028>] (nfs_wb_all) from [<803dd96c>] (nfs_setattr+0x280/0x2a4)
[ 4176.243314] [<803dd96c>] (nfs_setattr) from [<802bf8d4>] (notify_change+0x17c/0x410)
[ 4176.252319] [<802bf8d4>] (notify_change) from [<802d62fc>] (utimes_common+0xbc/0x188)
[ 4176.261418] [<802d62fc>] (utimes_common) from [<802d64c8>] (do_utimes+0x100/0x144)
[ 4176.270386] [<802d64c8>] (do_utimes) from [<802d6548>] (SyS_utimensat+0x3c/0xb0)
[ 4176.279426] [<802d6548>] (SyS_utimensat) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 4176.288484] INFO: task gzip:22449 blocked for more than 120 seconds.
[ 4176.293202] Tainted: G C 4.14.31-1-ARCH #1
[ 4176.297913] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4176.307420] gzip D 0 22449 22402 0x00000000
[ 4176.312234] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 4176.316976] [<80a88018>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 4176.321712] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[ 4176.330930] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[ 4176.340209] [<80230d0c>] (__filemap_fdatawait_range) from [<80230d70>] (filemap_fdatawait_range+0x18/0x28)
[ 4176.349341] [<80230d70>] (filemap_fdatawait_range) from [<802330f4>] (filemap_write_and_wait+0x58/0x7c)
[ 4176.358392] [<802330f4>] (filemap_write_and_wait) from [<803ea028>] (nfs_wb_all+0x14/0x15c)
[ 4176.367284] [<803ea028>] (nfs_wb_all) from [<803dd96c>] (nfs_setattr+0x280/0x2a4)
[ 4176.376299] [<803dd96c>] (nfs_setattr) from [<802bf8d4>] (notify_change+0x17c/0x410)
[ 4176.385274] [<802bf8d4>] (notify_change) from [<802d62fc>] (utimes_common+0xbc/0x188)
[ 4176.394337] [<802d62fc>] (utimes_common) from [<802d64c8>] (do_utimes+0x100/0x144)
[ 4176.403276] [<802d64c8>] (do_utimes) from [<802d6548>] (SyS_utimensat+0x3c/0xb0)
[ 4176.412262] [<802d6548>] (SyS_utimensat) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 4176.421295] INFO: task gzip:22452 blocked for more than 120 seconds.
[ 4176.425993] Tainted: G C 4.14.31-1-ARCH #1
[ 4176.430693] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4176.440149] gzip D 0 22452 22418 0x00000000
[ 4176.444909] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 4176.449585] [<80a88018>] (schedule) from [<80a8b3d4>] (rwsem_down_write_failed+0x12c/0x278)
[ 4176.458924] [<80a8b3d4>] (rwsem_down_write_failed) from [<80a8a6f0>] (down_write+0x58/0x60)
[ 4176.468288] [<80a8a6f0>] (down_write) from [<802afc48>] (path_openat+0x3b0/0x1150)
[ 4176.477766] [<802afc48>] (path_openat) from [<802b1954>] (do_filp_open+0x6c/0xdc)
[ 4176.487122] [<802b1954>] (do_filp_open) from [<8029edc4>] (do_sys_open+0x168/0x20c)
[ 4176.496594] [<8029edc4>] (do_sys_open) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 4176.505990] INFO: task mkdir:22457 blocked for more than 120 seconds.
[ 4176.510857] Tainted: G C 4.14.31-1-ARCH #1
[ 4176.515599] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4176.525022] mkdir D 0 22457 22453 0x00000000
[ 4176.529774] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 4176.534358] [<80a88018>] (schedule) from [<80a8b3d4>] (rwsem_down_write_failed+0x12c/0x278)
[ 4176.543475] [<80a8b3d4>] (rwsem_down_write_failed) from [<80a8a6f0>] (down_write+0x58/0x60)
[ 4176.552568] [<80a8a6f0>] (down_write) from [<802b1118>] (filename_create+0x70/0x14c)
[ 4176.561851] [<802b1118>] (filename_create) from [<802b1d30>] (SyS_mkdirat+0x4c/0xec)
[ 4176.571295] [<802b1d30>] (SyS_mkdirat) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 4176.580876] INFO: task mkdir:22458 blocked for more than 120 seconds.
[ 4176.585741] Tainted: G C 4.14.31-1-ARCH #1
[ 4176.590628] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 4176.600118] mkdir D 0 22458 22450 0x00000000
[ 4176.604882] [<80a87848>] (__schedule) from [<80a88018>] (schedule+0x3c/0xa0)
[ 4176.609664] [<80a88018>] (schedule) from [<80a8b3d4>] (rwsem_down_write_failed+0x12c/0x278)
[ 4176.618855] [<80a8b3d4>] (rwsem_down_write_failed) from [<80a8a6f0>] (down_write+0x58/0x60)
[ 4176.628053] [<80a8a6f0>] (down_write) from [<802b1118>] (filename_create+0x70/0x14c)
[ 4176.637238] [<802b1118>] (filename_create) from [<802b1d30>] (SyS_mkdirat+0x4c/0xec)
[ 4176.646634] [<802b1d30>] (SyS_mkdirat) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[ 4211.688544] nfs: server ease not responding, still trying
[ 4212.989190] nfs: server ease OK
[ 4212.989336] nfs: server ease OK
[ 4212.989372] nfs: server ease OK
[ 4212.992652] nfs: server ease OK
[ 4213.002084] nfs: server ease OK
[ 4213.002311] nfs: server ease OK
[ 4213.002416] nfs: server ease OK
[ 4213.017966] nfs: server ease OK
[ 4213.018012] nfs: server ease OK
[ 4213.018632] nfs: server ease OK
[ 4213.020006] nfs: server ease OK
[ 4401.133010] nfs: server ease not responding, still trying
[ 4401.133014] nfs: server ease not responding, still trying
[ 4401.133030] nfs: server ease not responding, still trying
[ 4401.133067] nfs: server ease not responding, still trying
[ 4401.133110] nfs: server ease not responding, still trying
[ 4401.133120] nfs: server ease not responding, still trying
[ 4401.133124] nfs: server ease not responding, still trying
[ 4401.133139] nfs: server ease not responding, still trying
[ 4401.133156] nfs: server ease not responding, still trying
[ 4401.133171] nfs: server ease not responding, still trying
[ 4401.133187] nfs: server ease not responding, still trying
[ 4401.133202] nfs: server ease not responding, still trying
[ 4401.133233] nfs: server ease not responding, still trying
[ 4401.133245] nfs: server ease not responding, still trying
[ 4401.133251] nfs: server ease not responding, still trying
[ 4443.397196] nfs: server ease OK
[ 4443.397213] nfs: server ease OK
[ 4443.397291] nfs: server ease OK
[ 4443.397316] nfs: server ease OK
[ 4443.397343] nfs: server ease OK
[ 4443.397410] nfs: server ease OK
[ 4443.397505] nfs: server ease OK
[ 4443.397580] nfs: server ease OK
[ 4443.397605] nfs: server ease OK
[ 4443.397714] nfs: server ease OK
[ 4443.399097] nfs: server ease OK
[ 4443.405096] nfs: server ease OK
[ 4443.405772] nfs: server ease OK
[ 4443.406117] nfs: server ease OK
[ 4443.406398] nfs: server ease OK
[ 4667.377155] nfs: server ease not responding, still trying
[ 4668.417708] nfs: server ease not responding, still trying
[ 4668.700017] nfs: server ease OK
[ 4668.700524] nfs: server ease OK
[ 4856.819062] nfs: server ease not responding, still trying
[ 4856.819067] nfs: server ease not responding, still trying
[ 4856.819082] nfs: server ease not responding, still trying
[ 4856.819130] nfs: server ease not responding, still trying
[ 4856.819135] nfs: server ease not responding, still trying
[ 4856.819142] nfs: server ease not responding, still trying
[ 4856.819154] nfs: server ease not responding, still trying
[ 4856.819174] nfs: server ease not responding, still trying
[ 4856.819188] nfs: server ease not responding, still trying
[ 4856.819209] nfs: server ease not responding, still trying
[ 4856.819216] nfs: server ease not responding, still trying
[ 4893.959982] nfs: server ease OK
[ 4893.960172] nfs: server ease OK
[ 4893.960210] nfs: server ease OK
[ 4893.960311] nfs: server ease OK
[ 4893.960640] nfs: server ease OK
[ 4893.960770] nfs: server ease OK
[ 4893.960780] nfs: server ease OK
[ 4893.961280] nfs: server ease OK
[ 4893.966452] nfs: server ease OK
[ 4893.967131] nfs: server ease OK
[ 4893.969369] nfs: server ease OK
[ 5123.060914] nfs: server ease not responding, still trying
[ 5124.101425] nfs: server ease not responding, still trying
[ 5124.376882] nfs: server ease OK
[ 5124.381100] nfs: server ease OK
[ 5353.461931] nfs: server ease not responding, still trying
[ 5354.784753] nfs: server ease OK
[ 5588.982673] nfs: server ease not responding, still trying
[ 5590.077559] nfs: server ease OK
[ 5814.263180] nfs: server ease not responding, still trying
[ 5815.303698] nfs: server ease not responding, still trying
[ 5815.334003] nfs: server ease OK
[ 5815.360538] nfs: server ease OK
[ 6044.663615] nfs: server ease not responding, still trying
[ 6045.721789] nfs: server ease OK
[ 6285.305546] nfs: server ease not responding, still trying
[ 6286.346054] nfs: server ease not responding, still trying
[ 6286.376999] nfs: server ease OK
[ 6286.403761] nfs: server ease OK
[ 6510.587277] nfs: server ease not responding, still trying
[ 6511.627865] nfs: server ease not responding, still trying
[ 6511.674761] nfs: server ease OK
[ 6511.686188] nfs: server ease OK
[ 6735.868562] nfs: server ease not responding, still trying
[ 6736.909076] nfs: server ease not responding, still trying
[ 6736.940771] nfs: server ease OK
[ 6736.967038] nfs: server ease OK
[ 6940.669438] nfs: server ease not responding, still trying
[ 6977.551872] nfs: server ease OK
I combined a few replies into one (above) and tried to make it it bit more concise. TL;DR version is that disabling EEE does not help.
If possible, and if it isn't already on, can you enable flow control on the switch port connected to the Pi?
@pelwell - All the wired connections go through an unmanaged switch. No settings to tweak :/
Bug: Frequent kernel oops due to blocked tasks when writing files to NFS mount.
I had similar issues on SAMBA mount. But currently I can not run tests again, because I sent back my Pi3B+.
IMO current revision of Pi3B+ has serious hardware issues and I don't believe that they can be solved via software (Finally, I never was able to play a video longer than 15mins without a Kodi crash, kernel Oops, or freeze)
@pelwell and co: Which Pi3B+ (revision) are you currently using? Parts of 0-series or parts from current production line, which customers are using now.
I still can't believe that you guys never had such issues before
Current production I believe. Although I don't think there have been many/any changes since the prototypes.
It appears that the issues are erratic, and depend on the capabilities of the network the device is attached to. We are trying to figure out the exact circumstances. I suspect there are a number of issues being seen, as often happens when a previously working driver suddenly because used by 250k extra people over a weekend, in all sorts of new and unpredictable ways. I have high hopes there is a software solution to this, we've always been able to find them in the past.
On 31 March 2018 at 19:29, Manfred Kreisl notifications@github.com wrote:
Bug: Frequent kernel oops due to blocked tasks when writing files to NFS mount.
I had similar issues on SAMBA mount. But currently I can not run tests again, because I sent back my Pi3B+.
IMO current revision of Pi3B+ has serious hardware issues and I don't believe that they can be solved via software (Finally, I never was able to play a video longer than 15mins without a Kodi crash, kernel Oops, or freeze)
@pelwell https://github.com/pelwell and co: Which Pi3B+ (revision) are you currently using? Parts of 0-series or parts from current production line, which customers are using now.
I still can't believe that you guys never had such issues before
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/raspberrypi/linux/issues/2482#issuecomment-377713477, or mute the thread https://github.com/notifications/unsubscribe-auth/ADqrHV52f7xwHg74CUm3wYPAGFDuyvYZks5tj8sFgaJpZM4TB_lw .
-- James Hughes Principal Software Engineer, Raspberry Pi (Trading) Ltd
Perso I back on pi 2b ... (He boot faster with the same SD card...)
@mkreisl when your pi is back if work normally I send my pi too
+1 I am noticing this issue as well when reading off samba mount. Brand new RPI 3B+.
@pelwell - From @popcornmix's advice in https://github.com/raspberrypi/linux/issues/2442, I built:
I automated that dd test I described above in a simple script that repeats the writing out of 1G worth of zero filled file over an NFS share 32 times. I then used histogram.py to compute the stats.
With the dtparam=eee=off
parameter set in /boot/config.txt
I got some consistent results:
% histogram.py -p < results_no_eee.csv
# NumSamples = 32; Min = 25.46; Max = 25.75
# Mean = 25.687864; Variance = 0.002705; SD = 0.052009; Median 25.693114
When I removed that line (reverting to the default state of it being on, 1 of the 32 runs was really long:
% histogram.py -p < results.csv
# NumSamples = 36; Min = 25.34; Max = 139.44
# Mean = 28.763650; Variance = 350.005030; SD = 18.708421; Median 25.599488
Since using dd is going to max out the bus, I will try compiling the kernel which is much more gentle to the network IO and much more prone to errors in my experience. Thoughts?
OK... still experiencing the timeouts when compiling to the NFS share with eee enabled despite the successful replicates of using dd above. I am currently building https://github.com/raspberrypi/linux/commit/c2eb30683b43b13b931bd9cfef6a2a09ac7b7c1e and will test it by compiling the kernel to NFS with eee enabled and with it disabled.
For reference, here is the script to automate the replicate compile jobs.
@pelwell - I am still getting network timeouts... below is with dtparam=eee=off
set booted into the latest kernel.
[11786.758187] nfs: server ease not responding, still trying
[11786.758192] nfs: server ease not responding, still trying
[11786.758206] nfs: server ease not responding, still trying
[11786.758225] nfs: server ease not responding, still trying
[11794.438353] INFO: task ld:25967 blocked for more than 120 seconds.
[11794.441599] Tainted: G C 4.14.32-2-ARCH #1
[11794.444867] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11794.451496] ld D 0 25967 25966 0x00000000
[11794.454918] [<80a87c48>] (__schedule) from [<80a88418>] (schedule+0x3c/0xa0)
[11794.458408] [<80a88418>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[11794.461670] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[11794.468043] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[11794.474533] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[11794.481165] [<80233190>] (filemap_write_and_wait_range) from [<803db254>] (nfs_file_fsync+0x30/0x280)
[11794.487571] [<803db254>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[11794.494001] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[11794.497339] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[11794.500698] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[11831.579322] nfs: server ease OK
[11831.579326] nfs: server ease OK
[11831.583067] nfs: server ease OK
[11831.583118] nfs: server ease OK
[12040.199240] INFO: task ld:27693 blocked for more than 120 seconds.
[12040.202836] Tainted: G C 4.14.32-2-ARCH #1
[12040.206449] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12040.213627] ld D 0 27693 27692 0x00000000
[12040.217311] [<80a87c48>] (__schedule) from [<80a88418>] (schedule+0x3c/0xa0)
[12040.220971] [<80a88418>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[12040.223568] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[12040.228677] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[12040.233740] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[12040.238903] [<80233190>] (filemap_write_and_wait_range) from [<803db254>] (nfs_file_fsync+0x30/0x280)
[12040.244189] [<803db254>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[12040.249445] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[12040.252070] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[12040.254713] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
[12101.639302] nfs: server ease not responding, still trying
[12101.639311] nfs: server ease not responding, still trying
[12101.639328] nfs: server ease not responding, still trying
[12142.599536] nfs: server ease not responding, still trying
[12143.639966] nfs: server ease not responding, still trying
[12143.900616] nfs: server ease OK
[12143.900633] nfs: server ease OK
[12143.909707] nfs: server ease OK
[12143.917548] nfs: server ease OK
[12143.917848] nfs: server ease OK
[12408.840196] nfs: server ease not responding, still trying
[12408.840200] nfs: server ease not responding, still trying
[12408.840228] nfs: server ease not responding, still trying
[12408.840248] nfs: server ease not responding, still trying
[12408.840274] nfs: server ease not responding, still trying
[12408.840412] INFO: task ld:29538 blocked for more than 120 seconds.
[12408.840421] Tainted: G C 4.14.32-2-ARCH #1
[12408.840424] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[12408.840430] ld D 0 29538 29537 0x00000000
[12408.840493] [<80a87c48>] (__schedule) from [<80a88418>] (schedule+0x3c/0xa0)
[12408.840514] [<80a88418>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[12408.840541] [<8015c138>] (io_schedule) from [<80230be4>] (wait_on_page_bit+0x110/0x15c)
[12408.840559] [<80230be4>] (wait_on_page_bit) from [<80230d0c>] (__filemap_fdatawait_range+0xdc/0x128)
[12408.840574] [<80230d0c>] (__filemap_fdatawait_range) from [<80233190>] (filemap_write_and_wait_range+0x54/0x88)
[12408.840596] [<80233190>] (filemap_write_and_wait_range) from [<803db254>] (nfs_file_fsync+0x30/0x280)
[12408.840618] [<803db254>] (nfs_file_fsync) from [<802d5dcc>] (vfs_fsync+0x24/0x2c)
[12408.840635] [<802d5dcc>] (vfs_fsync) from [<8029d758>] (filp_close+0x2c/0x80)
[12408.840648] [<8029d758>] (filp_close) from [<8029d7cc>] (SyS_close+0x20/0x48)
[12408.840665] [<8029d7cc>] (SyS_close) from [<80107ce0>] (ret_fast_syscall+0x0/0x4c)
Same here, nothing changed. Still absolutely unstable, unreliable and completely unusable, the Pi3B+
Odd, got one on my desk that is working fine. I think you forget to add "In the circumstances I am using it".
Anyway, issues still being looked at both here and at Microchip. There was a patch on the linux netdev list today for this chips driver (lan78xxx) for EEE which may well help, that will need to be tried. It's not like we are just sitting here twiddling our thumbs.
Anyway, issues still being looked at both here and at Microchip. There was a patch on the linux netdev list today for this chips driver (lan78xxx) for EEE which may well help, that will need to be tried. It's not like we are just sitting here twiddling our thumbs.
Seems you're getting fire under your a.. now :smile:
IMO you're looking at the wrong place. LAN issues are only the top of the iceberg
I was already reporting, that system is still unstable after that dump microchip is powered off and all traffic is going over wlan device. System still freezing randomly. So, before I'm better informed, I would say the hole Pi3B+ design is a huge issue
Some users who reported problems (and there honestly haven't been that many, but they are shouting loudly) have had success with adding sdram_freq=450
to config.txt. I would recommend anybody with stability problems (anything not obviously network related) to do the same.
Some users who reported problems (and there honestly haven't been that many, but they are shouting loudly) have had success with adding sdram_freq=450 to config.txt. I would recommend anybody with stability problems (anything not obviously network related) to do the same.
What's the default for Pi3B+. Cant find it here
500 turbo, 400 normal
For reference, here is the script to automate the replicate compile jobs.
@pelwell - I have some hard data now. I ran the make benchmark writing out to the NFS share under 2 conditions, once with eee disabled and once with it enabled. There is a clear trend: eee is causing problems.
make zImage
Here are 9 or 10 replicates running make zImage
with all times reported in minutes.
% histogram.py < eee_on_zimage
# NumSamples = 9; Min = 9.77; Max = 29.07
# Mean = 18.573025; Variance = 86.102905; SD = 9.279165; Median 10.764777
vs
% histogram.py < eee_off_zimage
# NumSamples = 10; Min = 9.91; Max = 10.87
# Mean = 10.178291; Variance = 0.067048; SD = 0.258936; Median 10.166035
Several trends from these data:
make modules
Here are 9 or 10 replicates running make modules
with all times reported in minutes.
% histogram.py < eee_on_modules
# NumSamples = 9; Min = 25.21; Max = 67.19
# Mean = 51.765753; Variance = 218.212739; SD = 14.772026; Median 46.494882
vs
% histogram.py < eee_off_modules
# NumSamples = 9; Min = 26.33; Max = 49.60
# Mean = 33.328529; Variance = 42.429103; SD = 6.513763; Median 32.126122
The same trends from these data:
I am happy to test future patches/firmware, whatever to help optimize this. I think the make zImage
benchmark will be sufficient for this since it's way faster than make modules
and gives similar results. Just let me know.
EDIT: I see @popcornmix pushed https://github.com/raspberrypi/firmware/commit/3aa806091dc57b757790b026c01754883cee2abc a few hours ago... time to retest?
@mkreisl - Please keep this issue on task... it's scoped for network writes not for general stability. Open a new task for that.
@graysky2 Oops, sorry for tainting your thread
A potential work-around: don't totally disable EEE, but set dtparam=tx_lpi_timer=10000
in /boot/config.txt
which I did and found nearly identical results in the make zImage
benchmark to having EEE totally disabled.
Again, values reported are compile times in minutes.
dtparam=tx_lpi_timer=10000
# NumSamples = 12; Min = 9.90; Max = 10.19
# Mean = 10.089245; Variance = 0.007412; SD = 0.086094; Median 10.119596
dtparam=eee=off
# NumSamples = 10; Min = 9.91; Max = 10.87
# Mean = 10.178291; Variance = 0.067048; SD = 0.258936; Median 10.166035
EDIT: see https://github.com/raspberrypi/linux/issues/2482#issuecomment-379551481 which demonstrates that the problem is still present.
All those EEE settings doesn't help for me, because my router/switch does not support EEE (most router with integrated switch does not support it) and I'm still getting nfs timeouts even if EEE is completely disabled, or I'm getting
Apr 6 16:11:14 kmxbilr2 kernel: [ 837.345227] CIFS VFS: sends on sock aa2921c0 stuck for 15 seconds
Apr 6 16:11:14 kmxbilr2 kernel: [ 837.345261] CIFS VFS: Error -11 sending data on socket to server
Apr 6 16:11:30 kmxbilr2 kernel: [ 852.705497] CIFS VFS: sends on sock aa2921c0 stuck for 15 seconds
Apr 6 16:11:30 kmxbilr2 kernel: [ 852.705532] CIFS VFS: Error -11 sending data on socket to server
Apr 6 16:11:30 kmxbilr2 kernel: [ 852.833704] CIFS VFS: Free previous auth_key.response = 99685c00
Apr 6 16:11:55 kmxbilr2 kernel: [ 878.305932] CIFS VFS: sends on sock aa29c380 stuck for 15 seconds
Apr 6 16:11:55 kmxbilr2 kernel: [ 878.305972] CIFS VFS: Error -11 sending data on socket to server
Apr 6 16:12:11 kmxbilr2 kernel: [ 893.666123] CIFS VFS: sends on sock aa29c380 stuck for 15 seconds
Apr 6 16:12:11 kmxbilr2 kernel: [ 893.666156] CIFS VFS: Error -11 sending data on socket to server
Apr 6 16:12:26 kmxbilr2 kernel: [ 909.026351] CIFS VFS: sends on sock aa29c380 stuck for 15 seconds
Apr 6 16:12:26 kmxbilr2 kernel: [ 909.026382] CIFS VFS: Error -11 sending data on socket to server
Apr 6 16:12:41 kmxbilr2 kernel: [ 924.386541] CIFS VFS: sends on sock aa29c380 stuck for 15 seconds
Apr 6 16:12:41 kmxbilr2 kernel: [ 924.386573] CIFS VFS: Error -11 sending data on socket to server
Apr 6 16:12:41 kmxbilr2 kernel: [ 924.484318] CIFS VFS: Free previous auth_key.response = a7910f00
if using SAMBA mount instead of NFS mount and after some time process that writes to share stucks and becomes uninterruptable 'D' state forever
@mkreisl - Are you booted into the same kernel and are you using the same firmware commit that I am? Kernel: https://github.com/raspberrypi/linux/commit/c2eb30683b43b13b931bd9cfef6a2a09ac7b7c1e Firmware: https://github.com/raspberrypi/firmware/commit/0dff9ecc00d8114164bb9a73799793dad24d8593
@graysky2 Kernel: yes (XBian built based on bcm2709_defconfig) Firmware: yes, excatly the same version
@mkreisl - not sure what to say then.... perhaps you have a different issue. As a control, have you tried the same stuff with another older RPi? Like a 2 or 3?
@graysky2 Sure, I'm running same procedure on Pi1, 2 and 3 (without +) since years without any problem.
@graysky2 In short words, I can explain what it does
1) mount network share (sshfs, nfs or samba) 2) create image on this share, big enough to backup data from root/boot fs into it 3) create partition in image (vfat for boot, btrfs for root) 4) copy boot partition into mounted image (loop device) 5) copy all subvolumes into mounted image (using btrfs send/receive or tar, both tested) 6) close everything and umount share
From 1 to 4 it always works, and within 5 it stucks always, but not on the same subvolume And, it does not matter if source fs is on sd, usb disk or iSCSI target
What is your NFS or samba server? How are these configured? I can complete these steps on mine without error except I don't do much with BTRFS. Can you complete the steps with ext4? I can:
# mount nfs share
mount -t nfs4 ease:/scratch /opt/armv7h/scratch
mount | grep opt
ease:/scratch on /opt/armv7/scratch type nfs4
(rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.9.8.108,local_lock=none,addr=10.9.8.128)
# make image and format it
cd /opt/armv7h/scratch
truncate -s 5G image.img
mkfs.ext4 -F image.img
mkdir mp
# mount it
mount -o loop image.img mp
chown facade:users mp
# add 1.0 GB of data to image
dd if=/dev/zero of=/opt/armv7/scratch/mp/zerofill bs=4M count=250 status=progress
ls -l mp
total 1024020
drwx------ 2 facade users 16384 Apr 7 12:30 lost+found
-rw-r--r-- 1 facade users 1048576000 Apr 7 12:32 zerofill
umount mp
What is your NFS or samba server? How are these configured? I can complete these steps on mine without error except I don't do much with BTRFS. Can you complete the steps with ext4?
Running Debian Stretch as NFS and SAMBA server. Nothing special configured AFAIR Currently it's not possible to run tests with ext4, don't have any configuation based on ext4. Will do more testing next days, actually I'm testing wlan only configuration.
You haven't said if you have changed sdram_freq or arm_freq yet.
@pelwell
You haven't said if you have changed sdram_freq or arm_freq yet.
Ehm @graysky2 told me that I should not post such topics here :disappointed:
Yes, changing sdram clock to 450 did it, setting arm clock to 1200 I tested 2 weeks ago without success
But, reducing sdram clock did not solve ethernet network issues discussed here
Ehm @graysky2 told me that I should not post such topics here disappointed
No, just reminded you that this task is about network write issues, if one of the developers asks you to try something, do it, just don't add a bunch of unrelated issues on top of this one.
Exactly. There appear to be two classes of errors - network problems and memory corruption. I'm satisfied this is purely a networking problem.
EEE does seem to worsen network throughout, enabling flow control improves it, and we haven't ruled out further driver problems.
Based on @mkreisl's observation about samba shares, I did the same make zImage
benchmark on a samba share (samba version 4.7.6) and found inconsistent performance (nothing in dmesg) though. This is with dtparam=tx_lpi_timer=10000
set:
# NumSamples = 4; Min = 462.65; Max = 1251.20
# Mean = 895.113710; Variance = 124647.496919; SD = 353.054524; Median 933.305226
1,zImage,631.770020346
2,zImage,462.645222405
3,zImage,1251.199164634
4,zImage,1234.840431102
After I got this result, I switched over the NFS and found highly consistent runs as before. Not sure what to make of it.
@mkreisl - You game to try the benchmark on your system? You just need a kernel source that contains some .config ready to go. See here.
@graysky2 modified your script as follows and let it run
#!/bin/bash
cd /srv/backup
mount | grep /srv/backup
truncate -s 8G image.img
mkfs.ext4 -F image.img
mkdir -p img
mount -o loop image.img img
to='--warning=none --acls --xattrs'
tar c --one-file-system $to -C / . | tar x $to -v -C /srv/backup/img
ls -l img
umount img && rmdir img && rm -f image.img
Destination is ext4, but source is still BTRFS
It ends up always with nfs: server kmcubie not responding, timed out
[ 1692.649145] INFO: task loop0:9479 blocked for more than 120 seconds.
[ 1692.649159] Tainted: G C 4.14.32+ #1
[ 1692.649167] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1692.649175] loop0 D 0 9479 2 0x00000000
[ 1692.649219] [<808c4124>] (__schedule) from [<808c47d4>] (schedule+0x58/0xb8)
[ 1692.649240] [<808c47d4>] (schedule) from [<8014be98>] (io_schedule+0x20/0x40)
[ 1692.649262] [<8014be98>] (io_schedule) from [<80221fb8>] (wait_on_page_bit+0x108/0x128)
[ 1692.649279] [<80221fb8>] (wait_on_page_bit) from [<802220bc>] (__filemap_fdatawait_range+0xe4/0x114)
[ 1692.649297] [<802220bc>] (__filemap_fdatawait_range) from [<80224614>] (filemap_write_and_wait_range+0x68/0xa4)
[ 1692.649316] [<80224614>] (filemap_write_and_wait_range) from [<803a0184>] (nfs_file_fsync+0x3c/0x308)
[ 1692.649335] [<803a0184>] (nfs_file_fsync) from [<802c7644>] (vfs_fsync_range+0x68/0xbc)
[ 1692.649356] [<802c7644>] (vfs_fsync_range) from [<802c76cc>] (vfs_fsync+0x34/0x3c)
[ 1692.649383] [<802c76cc>] (vfs_fsync) from [<7f007e80>] (loop_queue_work+0x574/0xb98 [loop])
[ 1692.649438] [<7f007e80>] (loop_queue_work [loop]) from [<8013dff4>] (kthread_worker_fn+0xbc/0x1e0)
[ 1692.649459] [<8013dff4>] (kthread_worker_fn) from [<7f005738>] (loop_kthread_worker_fn+0x34/0x194 [loop])
[ 1692.649480] [<7f005738>] (loop_kthread_worker_fn [loop]) from [<8013dec4>] (kthread+0x13c/0x16c)
[ 1692.649502] [<8013dec4>] (kthread) from [<8010834c>] (ret_from_fork+0x14/0x28)
[ 1784.809482] nfs: server kmcubie not responding, timed out
[ 1784.809540] nfs: server kmcubie not responding, timed out
[ 1784.809563] nfs: server kmcubie not responding, timed out
[ 1813.481779] INFO: task kworker/u8:2:7543 blocked for more than 120 seconds.
[ 1813.481794] Tainted: G C 4.14.32+ #1
[ 1813.481801] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1813.481810] kworker/u8:2 D 0 7543 2 0x00000000
[ 1813.481844] Workqueue: writeback wb_workfn (flush-7:0)
[ 1813.481885] [<808c4124>] (__schedule) from [<808c47d4>] (schedule+0x58/0xb8)
[ 1813.481903] [<808c47d4>] (schedule) from [<8014be98>] (io_schedule+0x20/0x40)
[ 1813.481923] [<8014be98>] (io_schedule) from [<805688c8>] (blk_mq_get_tag+0x1a4/0x2fc)
[ 1813.481944] [<805688c8>] (blk_mq_get_tag) from [<805634e8>] (blk_mq_get_request+0x134/0x43c)
[ 1813.481961] [<805634e8>] (blk_mq_get_request) from [<805661a8>] (blk_mq_make_request+0xec/0x758)
[ 1813.481980] [<805661a8>] (blk_mq_make_request) from [<80558d50>] (generic_make_request+0xe8/0x260)
[ 1813.481999] [<80558d50>] (generic_make_request) from [<80558f80>] (submit_bio+0xb8/0x1a8)
[ 1813.482018] [<80558f80>] (submit_bio) from [<8035dec8>] (ext4_io_submit+0x58/0x64)
[ 1813.482038] [<8035dec8>] (ext4_io_submit) from [<80342b6c>] (ext4_writepages+0x718/0x998)
[ 1813.482056] [<80342b6c>] (ext4_writepages) from [<80234668>] (do_writepages+0x30/0x9c)
[ 1813.482074] [<80234668>] (do_writepages) from [<802c20d8>] (__writeback_single_inode+0x44/0x554)
[ 1813.482092] [<802c20d8>] (__writeback_single_inode) from [<802c2b74>] (writeback_sb_inodes+0x238/0x4c0)
[ 1813.482111] [<802c2b74>] (writeback_sb_inodes) from [<802c2e8c>] (__writeback_inodes_wb+0x90/0xd0)
[ 1813.482128] [<802c2e8c>] (__writeback_inodes_wb) from [<802c31b8>] (wb_writeback+0x2ec/0x3bc)
[ 1813.482144] [<802c31b8>] (wb_writeback) from [<802c3c24>] (wb_workfn+0x1d4/0x570)
[ 1813.482167] [<802c3c24>] (wb_workfn) from [<80137b18>] (process_one_work+0x140/0x4c4)
[ 1813.482187] [<80137b18>] (process_one_work) from [<80137ef8>] (worker_thread+0x5c/0x568)
[ 1813.482206] [<80137ef8>] (worker_thread) from [<8013dec4>] (kthread+0x13c/0x16c)
[ 1813.482226] [<8013dec4>] (kthread) from [<8010834c>] (ret_from_fork+0x14/0x28)
[ 1813.482258] INFO: task jbd2/loop0-8:9482 blocked for more than 120 seconds.
[ 1813.482266] Tainted: G C 4.14.32+ #1
[ 1813.482272] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1813.482278] jbd2/loop0-8 D 0 9482 2 0x00000000
[ 1813.482303] [<808c4124>] (__schedule) from [<808c47d4>] (schedule+0x58/0xb8)
[ 1813.482318] [<808c47d4>] (schedule) from [<8014be98>] (io_schedule+0x20/0x40)
[ 1813.482333] [<8014be98>] (io_schedule) from [<808c525c>] (bit_wait_io+0x1c/0x70)
[ 1813.482350] [<808c525c>] (bit_wait_io) from [<808c4ee8>] (__wait_on_bit+0x94/0xd4)
[ 1813.482368] [<808c4ee8>] (__wait_on_bit) from [<808c4fb0>] (out_of_line_wait_on_bit+0x88/0x90)
[ 1813.482385] [<808c4fb0>] (out_of_line_wait_on_bit) from [<802c988c>] (__wait_on_buffer+0x34/0x38)
[ 1813.482404] [<802c988c>] (__wait_on_buffer) from [<803838a0>] (jbd2_journal_commit_transaction+0x145c/0x191c)
[ 1813.482428] [<803838a0>] (jbd2_journal_commit_transaction) from [<80388e40>] (kjournald2+0x110/0x2a0)
[ 1813.482446] [<80388e40>] (kjournald2) from [<8013dec4>] (kthread+0x13c/0x16c)
[ 1813.482464] [<8013dec4>] (kthread) from [<8010834c>] (ret_from_fork+0x14/0x28)
[ 1813.482477] INFO: task tar:9486 blocked for more than 120 seconds.
[ 1813.482487] Tainted: G C 4.14.32+ #1
[ 1813.482493] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1813.482500] tar D 0 9486 9454 0x00000000
[ 1813.482523] [<808c4124>] (__schedule) from [<808c47d4>] (schedule+0x58/0xb8)
[ 1813.482538] [<808c47d4>] (schedule) from [<8014be98>] (io_schedule+0x20/0x40)
[ 1813.482553] [<8014be98>] (io_schedule) from [<805688c8>] (blk_mq_get_tag+0x1a4/0x2fc)
[ 1813.482570] [<805688c8>] (blk_mq_get_tag) from [<805634e8>] (blk_mq_get_request+0x134/0x43c)
[ 1813.482586] [<805634e8>] (blk_mq_get_request) from [<805661a8>] (blk_mq_make_request+0xec/0x758)
[ 1813.482602] [<805661a8>] (blk_mq_make_request) from [<80558d50>] (generic_make_request+0xe8/0x260)
[ 1813.482619] [<80558d50>] (generic_make_request) from [<80558f80>] (submit_bio+0xb8/0x1a8)
[ 1813.482635] [<80558f80>] (submit_bio) from [<802cca70>] (submit_bh_wbc+0x190/0x1c0)
[ 1813.482653] [<802cca70>] (submit_bh_wbc) from [<802cd268>] (submit_bh+0x24/0x2c)
[ 1813.482672] [<802cd268>] (submit_bh) from [<8031fe1c>] (ext4_read_block_bitmap_nowait+0x354/0x80c)
[ 1813.482691] [<8031fe1c>] (ext4_read_block_bitmap_nowait) from [<8034b544>] (ext4_mb_init_cache+0x128/0x750)
[ 1813.482709] [<8034b544>] (ext4_mb_init_cache) from [<8034bd2c>] (ext4_mb_init_group+0x1c0/0x2b0)
[ 1813.482725] [<8034bd2c>] (ext4_mb_init_group) from [<8034bfb4>] (ext4_mb_good_group+0x198/0x1a8)
[ 1813.482744] [<8034bfb4>] (ext4_mb_good_group) from [<8034ee20>] (ext4_mb_regular_allocator+0x2b8/0x408)
[ 1813.482761] [<8034ee20>] (ext4_mb_regular_allocator) from [<80351124>] (ext4_mb_new_blocks+0x854/0xe8c)
[ 1813.482781] [<80351124>] (ext4_mb_new_blocks) from [<803299b0>] (ext4_ext_map_blocks+0xccc/0x13b8)
[ 1813.482799] [<803299b0>] (ext4_ext_map_blocks) from [<8033e0e0>] (ext4_map_blocks+0x114/0x600)
[ 1813.482814] [<8033e0e0>] (ext4_map_blocks) from [<8033f0e8>] (ext4_getblk+0x4c/0x184)
[ 1813.482829] [<8033f0e8>] (ext4_getblk) from [<8033f23c>] (ext4_bread+0x1c/0xac)
[ 1813.482845] [<8033f23c>] (ext4_bread) from [<80356cf4>] (ext4_append+0x68/0x10c)
[ 1813.482863] [<80356cf4>] (ext4_append) from [<8035b050>] (ext4_mkdir+0x210/0x414)
[ 1813.482883] [<8035b050>] (ext4_mkdir) from [<8029e14c>] (vfs_mkdir+0x100/0x154)
[ 1813.482908] [<8029e14c>] (vfs_mkdir) from [<802a261c>] (SyS_mkdirat+0x74/0xc0)
[ 1813.482927] [<802a261c>] (SyS_mkdirat) from [<801082a0>] (ret_fast_syscall+0x0/0x28)
But it does not always block everything infinitely bc mounted loop device will randomly become ro (after a looong time doing nothing) and tar does not continue copying, just throws a lot of error messages to the screen or tar continues copying until nfs is blocked again or third case is blocking infinitely
root@kmxbilr2 ~/bin # ps ax | grep tar\
13829 pts/3 S+ 0:18 tar c --one-file-system --warning=none --acls --xattrs -C / .
13830 pts/3 D+ 0:20 tar x --warning=none --acls --xattrs -C /srv/backup/img
15679 pts/1 S+ 0:00 grep --color=auto tar
root@kmxbilr2 ~/bin #
This is my eee setting:
root@kmxbilr2 ~/bin # ethtool --show-eee eth0
EEE Settings for eth0:
EEE status: disabled
Tx LPI: disabled
Supported EEE link modes: 100baseT/Full
1000baseT/Full
Advertised EEE link modes: Not reported
Link partner advertised EEE link modes: Not reported
root@kmxbilr2 ~/bin #
So, I run above test three times:
1st run: loop device becomes r/o 2nd run: test finished 3rd run: test blocked infinitely
but getting always messages like this: NFO: task loop0:9479 blocked for more than 120 seconds.
Ah, forgot, destination folder is mouted as follows
kmcubie:/media/system/backup/kmxbilr2 on /srv/backup type nfs4 (rw,nosuid,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,soft,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.24,local_lock=none,addr=192.168.1.6)
Mounted via autofs
EDIT:
After 45min of blocking loop fs becomes r/o in 3rd run and script finished
Destination is ext4, but source is still BTRFS It ends up always with nfs: server kmcubie not responding, timed out
I don't use BTRFS nor do I have access to your specific filesystem, so let's come up with a standard script anyone can use provided that they have an NFS server and can mount to it.
Here is a such a script.
Run this script on the RPi3B+ and using the identical micro SD card, on known-good board which in my case is a RPi2.
dtparam=eee=off
set in /etc/boot.txt
@pelwell
% mount | grep nfs
ease:/scratch on /scratch type nfs4 (rw,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.9.8.108,local_lock=none,addr=10.9.8.128)
Output of the script:
# /root/bin/nfstest
Writting round 1
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 605 s, 6.9 MB/s
1000+0 records in
1000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 605.129 s, 6.9 MB/s
total 4096020
drwx------ 2 root root 16384 Apr 8 12:54 lost+found
-rw-r--r-- 1 root root 4194304000 Apr 8 13:04 rnd.1
Round 1 done
Writting round 2
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 930 s, 4.5 MB/s
1000+0 records in
1000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 929.514 s, 4.5 MB/s
total 4096020
drwx------ 2 root root 16384 Apr 8 13:05 lost+found
-rw-r--r-- 1 root root 4194304000 Apr 8 13:20 rnd.2
Round 2 done
Writting round 3
2990538752 bytes (3.0 GB, 2.8 GiB) copied, 679 s, 4.4 MB/s
It is currently stuck on round 3. System is still up but not very responsive. From dmsg:
[ 90.991309] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
[ 699.983054] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
[ 858.079796] INFO: task jbd2/loop0-8:633 blocked for more than 120 seconds.
[ 858.079980] Tainted: G C 4.14.32-2-ARCH #1
[ 858.080112] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 858.080290] jbd2/loop0-8 D 0 633 2 0x00000000
[ 858.080451] [<80a87c48>] (__schedule) from [<80a88418>] (schedule+0x3c/0xa0)
[ 858.080627] [<80a88418>] (schedule) from [<8015c138>] (io_schedule+0x14/0x3c)
[ 858.080801] [<8015c138>] (io_schedule) from [<80a88d28>] (bit_wait_io+0xc/0x60)
[ 858.080975] [<80a88d28>] (bit_wait_io) from [<80a8899c>] (__wait_on_bit+0x60/0xb8)
[ 858.081158] [<80a8899c>] (__wait_on_bit) from [<80a88a68>] (out_of_line_wait_on_bit+0x74/0xa0)
[ 858.085451] [<80a88a68>] (out_of_line_wait_on_bit) from [<803be5b4>] (jbd2_journal_commit_transaction+0xf6c/0x17d8)
[ 858.094108] [<803be5b4>] (jbd2_journal_commit_transaction) from [<803c3ea4>] (kjournald2+0xe0/0x2e4)
[ 858.102800] [<803c3ea4>] (kjournald2) from [<8014e708>] (kthread+0x138/0x150)
[ 858.107318] [<8014e708>] (kthread) from [<80107db8>] (ret_from_fork+0x14/0x3c)
[ 1633.827852] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
Output of the script:
# /root/bin/nfstest
Writting round 1
4185915392 bytes (4.2 GB, 3.9 GiB) copied, 373 s, 11.2 MB/s
1000+0 records in
1000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 373.4 s, 11.2 MB/s
total 4096020
drwx------ 2 root root 16384 Apr 8 09:44 lost+found
-rw-r--r-- 1 root root 4194304000 Apr 8 09:51 rnd.1
Round 1 done
Writting round 2
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 373 s, 11.3 MB/s
1000+0 records in
1000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 372.777 s, 11.3 MB/s
total 4096020
drwx------ 2 root root 16384 Apr 8 09:51 lost+found
-rw-r--r-- 1 root root 4194304000 Apr 8 09:57 rnd.2
Round 2 done
Writting round 3
4190109696 bytes (4.2 GB, 3.9 GiB) copied, 369 s, 11.3 MB/s
1000+0 records in
1000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 369.738 s, 11.3 MB/s
total 4096020
drwx------ 2 root root 16384 Apr 8 09:57 lost+found
-rw-r--r-- 1 root root 4194304000 Apr 8 10:03 rnd.3
Round 3 done
Writting round 4
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 368 s, 11.4 MB/s
1000+0 records in
1000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 367.684 s, 11.4 MB/s
total 4096020
drwx------ 2 root root 16384 Apr 8 10:04 lost+found
-rw-r--r-- 1 root root 4194304000 Apr 8 10:10 rnd.4
Round 4 done
nfstest 0.64s user 157.45s system 10% cpu 25:36.18 total
From dmesg:
[ 26.240546] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
[ 411.226606] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
[ 759.730166] kworker/0:1H: page allocation failure: order:0, mode:0x1080020(GFP_ATOMIC), nodemask=(null)
[ 759.739788] kworker/0:1H cpuset=/ mems_allowed=0
[ 759.744512] CPU: 0 PID: 196 Comm: kworker/0:1H Tainted: G C 4.14.32-2-ARCH #1
[ 759.752917] Hardware name: BCM2835
[ 759.756391] Workqueue: xprtiod xs_tcp_data_receive_workfn
[ 759.761905] [<8010ef14>] (unwind_backtrace) from [<8010bbc0>] (show_stack+0x10/0x14)
[ 759.769790] [<8010bbc0>] (show_stack) from [<80a71e3c>] (dump_stack+0x9c/0xc8)
[ 759.777144] [<80a71e3c>] (dump_stack) from [<8023bea4>] (warn_alloc+0xd0/0x194)
[ 759.784584] [<8023bea4>] (warn_alloc) from [<8023cf98>] (__alloc_pages_nodemask+0xf98/0x10ec)
[ 759.817701] [<8023cf98>] (__alloc_pages_nodemask) from [<8023d2f0>] (page_frag_alloc+0x168/0x198)
[ 759.877932] [<8023d2f0>] (page_frag_alloc) from [<8096695c>] (__netdev_alloc_skb+0x90/0x10c)
[ 759.938928] [<8096695c>] (__netdev_alloc_skb) from [<8080d36c>] (rx_submit+0x34/0x214)
[ 760.001476] [<8080d36c>] (rx_submit) from [<8080d8fc>] (rx_complete+0x138/0x220)
[ 760.065467] [<8080d8fc>] (rx_complete) from [<80819888>] (__usb_hcd_giveback_urb+0x98/0x154)
[ 760.131215] [<80819888>] (__usb_hcd_giveback_urb) from [<808424f0>] (completion_tasklet_func+0x78/0xbc)
[ 760.198174] [<808424f0>] (completion_tasklet_func) from [<80134f44>] (tasklet_hi_action+0x64/0xf8)
[ 760.264776] [<80134f44>] (tasklet_hi_action) from [<80101574>] (__do_softirq+0xc4/0x328)
[ 760.331823] [<80101574>] (__do_softirq) from [<801349e8>] (do_softirq.part.4+0x48/0x50)
[ 760.400468] [<801349e8>] (do_softirq.part.4) from [<80134aa8>] (__local_bh_enable_ip+0xb8/0xd8)
[ 760.471671] [<80134aa8>] (__local_bh_enable_ip) from [<80a47024>] (xs_tcp_data_receive_workfn+0x88/0x194)
[ 760.543912] [<80a47024>] (xs_tcp_data_receive_workfn) from [<801487f0>] (process_one_work+0x210/0x4fc)
[ 760.617849] [<801487f0>] (process_one_work) from [<80149718>] (worker_thread+0x34/0x5d0)
[ 760.691146] [<80149718>] (worker_thread) from [<8014e708>] (kthread+0x138/0x150)
[ 760.763687] [<8014e708>] (kthread) from [<80107db8>] (ret_from_fork+0x14/0x3c)
[ 760.803499] Mem-Info:
[ 760.838097] active_anon:2032 inactive_anon:2748 isolated_anon:0
active_file:8207 inactive_file:204480 isolated_file:0
unevictable:0 dirty:8517 writeback:28663 unstable:2
slab_reclaimable:6216 slab_unreclaimable:3548
mapped:4648 shmem:56 pagetables:200 bounce:0
free:4508 free_pcp:242 free_cma:1299
[ 761.056841] Node 0 active_anon:8128kB inactive_anon:10992kB active_file:32828kB inactive_file:817920kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:18592kB dirty:34068kB writeback:114652kB shmem:224kB writeback_tmp:0kB unstable:8kB all_unreclaimable? no
[ 761.198936] Normal free:18032kB min:3884kB low:4852kB high:5820kB active_anon:8128kB inactive_anon:10992kB active_file:32828kB inactive_file:817948kB unevictable:0kB writepending:148628kB present:983040kB managed:958392kB mlocked:0kB kernel_stack:896kB pagetables:800kB bounce:0kB free_pcp:968kB local_pcp:108kB free_cma:5196kB
[ 761.376256] lowmem_reserve[]: 0 0
[ 761.408452] Normal: 95*4kB (UMEC) 155*8kB (UMEHC) 152*16kB (UME) 218*32kB (MEH) 15*64kB (UMC) 2*128kB (MH) 0*256kB 1*512kB (H) 1*1024kB (C) 0*2048kB 1*4096kB (C) = 17876kB
[ 761.509861] 212750 total pagecache pages
[ 761.541672] 0 pages in swap cache
[ 761.572101] Swap cache stats: add 0, delete 0, find 0/0
[ 761.603884] Free swap = 1191420kB
[ 761.633321] Total swap = 1191420kB
[ 761.661755] 245760 pages RAM
[ 761.688868] 0 pages HighMem/MovableOnly
[ 761.716659] 6162 pages reserved
[ 761.743400] 2048 pages cma reserved
[ 796.738005] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
[ 1180.359713] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
@graysky2
When I run it on my 3B+, I see that the original bug I for which I opened this ticket is STILL presen
Can confirm this ...
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 605 s, 6.9 MB/s
... and your network performance is ridiculous slow
btw, I'm getting better results if I turn off flow control completely
ethtool -A eth0 tx off rx off autoneg off && ethtool -r eth0
Infinite locks are gone, but of course, those annoying locks are still there
Well, the important thing is that you are independently able to reproduce this bug.
... and your network performance is ridiculous slow
Just on the 3B+ and I believe it is due to the locks ... the network is all cat6 going through an SG100-16. Here is an x86_64 box doing the same 4G write:
% dd if=/dev/zero of=/mnt/media/test/fill bs=4M count=1000 status=progress
4177526784 bytes (4.2 GB, 3.9 GiB) copied, 31 s, 135 MB/s
1000+0 records in
1000+0 records out
4194304000 bytes (4.2 GB, 3.9 GiB) copied, 35.8983 s, 117 MB/s
@graysky2 In order to identify a locking issue you could try to recompile your kernel with the following options:
Kernel hacking --->
Lock Debugging (spinlocks, mutexes, etc...) --->
Lock debugging: prove locking correctness (CONFIG_PROVE_LOCKING=y)
Sleep inside atomic section checking (CONFIG_DEBUG_ATOMIC_SLEEP=y)
@lategoodbye @pelwell - I never thought to see what what going on the nfs server's side of things. My NFS server is running on a i7-4790K machine. The dmesg log there gets spammed with hundreds of lines like this the below that coincide with freeze ups on the RPi3BPlus:
[ 3624.508074] nfsd4_validate_stateid: 119 callbacks suppressed
[ 3624.508074] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 3624.508938] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 3624.509754] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 3624.510577] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 3624.511396] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 3624.512229] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 3624.513059] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 3624.513872] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 3624.514688] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 3624.515514] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 3767.535787] nfsd4_validate_stateid: 119 callbacks suppressed
[ 3767.535788] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4105.423284] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4105.423887] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4105.424487] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4105.425385] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4105.425979] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4105.426585] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4105.427501] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4105.428322] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4105.429125] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4218.046249] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
[ 4218.047093] NFSD: client 10.9.8.108 testing state ID with incorrect client ID
@lategoodbye - I don't think the kernel has what you need... I booted into it but get this on lock up:
[ 532.399675] EXT4-fs (loop0): mounted filesystem with ordered data mode. Opts: (null)
[ 858.099808] INFO: task jbd2/loop0-8:859 blocked for more than 120 seconds.
[ 858.100037] Tainted: G WC 4.14.32-3-ARCH #1
[ 858.100215] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 858.100453] jbd2/loop0-8 D 0 859 2 0x00000000
[ 858.100665] [<80baebf0>] (__schedule) from [<80baf72c>] (schedule+0x50/0xc0)
[ 858.100892] [<80baf72c>] (schedule) from [<80168230>] (io_schedule+0x14/0x3c)
[ 858.101124] [<80168230>] (io_schedule) from [<80bb00f4>] (bit_wait_io+0xc/0x60)
[ 858.101354] [<80bb00f4>] (bit_wait_io) from [<80bafd68>] (__wait_on_bit+0x60/0xb8)
[ 858.101594] [<80bafd68>] (__wait_on_bit) from [<80bafe34>] (out_of_line_wait_on_bit+0x74/0xa0)
[ 858.101868] [<80bafe34>] (out_of_line_wait_on_bit) from [<80420fe0>] (jbd2_journal_commit_transaction+0x1470/0x1ef8)
[ 858.102191] [<80420fe0>] (jbd2_journal_commit_transaction) from [<80426e70>] (kjournald2+0xdc/0x2d0)
[ 858.102472] [<80426e70>] (kjournald2) from [<80157430>] (kthread+0x148/0x164)
[ 858.102697] [<80157430>] (kthread) from [<8010824c>] (ret_from_fork+0x14/0x28)
[ 858.102923] INFO: lockdep is turned off.
@graysky2 This means you build the kernel with these options, but it didn't trigger any warning except of the hung? I tought it would be worth to try it.
@lategoodbye -
[ 858.102923] INFO: lockdep is turned off.
Perhaps I didn't compile with enough of these options enabled? Here is the nconfig... more?
.config - Linux/arm 4.14.32-3 Kernel Configuration
┌── Lock Debugging (spinlocks, mutexes, etc...) ────────────────────────────────┐
│ │
│ -*- RT Mutex debugging, deadlock detection │
│ -*- Spinlock and rw-lock debugging: basic checks │
│ -*- Mutex debugging: basic checks │
│ [ ] Wait/wound mutex debugging: Slowpath testing │
│ -*- Lock debugging: detect incorrect freeing of live locks │
│ [*] Lock debugging: prove locking correctness │
│ [ ] Lock usage statistics │
│ [ ] Lock dependency engine debugging (NEW) │
│ [*] Sleep inside atomic section checking │
│ [ ] Locking API boot-time self-tests │
│ < > torture tests for locking │
│ < > Wait/wound mutex selftests │
│ │
@graysky2 If you want to doublecheck it, search for "Lock dependency validator" within dmesg. The default RPI kernel doesn't have it.
Edit: The config above looks good. Edit2: lockdep is disabled because of the hung.
If you want to doublecheck it, search for "Lock dependency validator" within dmesg. The default RPI kernel doesn't have it.
It's there:
% dmesg | grep -i validator
[ 0.004391] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
Edit: The config above looks good. Edit2: lockdep is disabled because of the hung.
OK... so no need to recompile... and we are back to square 1 :(
@graysky2 I plan to submit RPi 3+ support to Linux Mainline soon. Any chance for you to give it a try?
Platform/Distro: RPi 3B+ running Arch ARM (armv7h).
Kernel version: 4.14.31 (https://github.com/raspberrypi/linux/commit/b36f4e9e198477803d29861e02d3ea00fe5e09ab) Firmware version: latest as I write this (https://github.com/raspberrypi/firmware/commit/c14a90333c13f507ab219d583b74a998ec11a6e7)
Bug: Frequent kernel oops due to blocked tasks when writing files to NFS mount.
Details: When compiling, dmesg is full of kernel oops like the below when doing so on an NFS mount. Compiling to the micro SD card is fine. I believe that the software (disto) on the micro SD card is NOT to blame... if I put the same micro SD card into a RPi3 or RPi2, I can compile without error.
Again, I am using an NFS mounted partition (/scratch) on which to compile, so I'm hypothesizing that these problems are related to the network driver.