openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.52k stars 1.74k forks source link

Pool err=28 flags=0x4000 bookmark=515 - task txg_sync:3412555 blocked for more than 120 seconds. Call Trace #13959

Open slrslr opened 2 years ago

slrslr commented 2 years ago

System information

Distribution Name: Linux Debian 11 up to date zfs version: zfs-2.1.5-1~bpo11+1; zfs-kmod-2.1.5-1~bpo11+1 uname -r: 5.10.0-18-amd64

Hello, sdc, sdd drives = 1 ZFS pool, sde is another ZFS pool. I am syncing to sde pool ("poolname") using syncoid tool (sudo zfs send 'srcpool/data'@'syncoid_pc_2021-06-07' | mbuffer -q -s 128k -m 16M 2>/dev/null | pv -s 6451598964144 | sudo zfs receive -s -F 'poolname/data).

During last hours when "syncoid -r sourcepool destpool" was run, I have seen a few journal lines of this kind: kernel: blk_update_request: I/O error, dev sdd, sector 22920562456 op 0x1:(WRITE) flags 0x4700 phys_seg 30 prio class 0

Then like 6 such lines in one minute and then drive started producing repeated deep clicking noise each several seconds, like the heads full speed stop.

ksysguardd[4181945]: Disk device disappeared (this may be about different drive)

repeated messages:

kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel:       Tainted: P        W  OE     5.10.0-18-amd64 #1 Debian 5.10.140-1
kernel: INFO: task txg_sync:3412555 blocked for more than 120 seconds.

(similar messages also below..)

tons of these messages:

zed[230017]: eid=6278 class=data pool='poolname' priority=3 err=28 flags=0x4000 bookmark=515:21706:0:4844
zed[230020]: eid=6279 class=data pool='poolname' priority=3 err=28 flags=0x4000 bookmark=515:21706:0:4845

...

zed[232943]: eid=7007 class=data pool='poolname' priority=3 err=28 flags=0x4000 bookmark=515:21706:1:1
zed[232947]: eid=7008 class=data pool='poolname' priority=3 err=28 flags=0x4000 bookmark=515:21705:1:8
zed[232952]: eid=7009 class=data pool='poolname' priority=3 err=28 flags=0x4000 bookmark=515:21706:0:4981
zed[232956]: eid=7010 class=data pool='poolname' priority=3 err=28 flags=0x4000 bookmark=515:21706:0:5262
zed[232959]: eid=7011 class=data pool='poolname' priority=3 err=28 flags=0x4000 bookmark=515:21705:2:0
zed[232963]: eid=7012 class=data pool='poolname' priority=3 err=28 flags=0x4000 bookmark=515:0:0:678
zed[232968]: eid=7013 class=data pool='poolname' priority=3 err=28 flags=0x4000 bookmark=515:0:1:0
zed[232970]: eid=7014 class=data pool='poolname' priority=3 err=28 flags=0x4000 bookmark=515:0:-1:0
zed[232971]: eid=7015 class=io_failure pool='poolname'
kernel: scsi 8:0:0:0: Direct-Access     ST16000N--redacted--     SN02 PQ: 0 ANSI: 5
kernel: sd 8:0:0:0: Attached scsi generic sg3 type 0
kernel: sd 8:0:0:0: [sde] Very big device. Trying to use READ CAPACITY(16).
kernel: sd 8:0:0:0: [sde] 31251759104 512-byte logical blocks: (16.0 TB/14.6 TiB)
kernel: sd 8:0:0:0: [sde] Write Protect is off
kernel: sd 8:0:0:0: [sde] Mode Sense: 28 00 00 00
kernel: sd 8:0:0:0: [sde] No Caching mode page found
kernel: sd 8:0:0:0: [sde] Assuming drive cache: write through
ksysguardd[4181945]: Disk device disappeared
kernel:  sde: sde1 sde9
ksysguardd[4181945]: Disk device disappeared
org.kde.ksystemstats[4181941]: sensor info answer was empty, ( false ) or sensors does not exist to us
org.kde.ksystemstats[4181941]: sensor info answer was empty, ( false ) or sensors does not exist to us
org.kde.ksystemstats[4181941]: sensor info answer was empty, ( false ) or sensors does not exist to us
org.kde.ksystemstats[4181941]: sensor info answer was empty, ( false ) or sensors does not exist to us
ksysguardd[4181945]: Disk device disappeared
ksysguardd[4181945]: Disk device disappeared
kernel: sd 8:0:0:0: [sde] Attached SCSI disk
udisksd[4550]: Error performing initial housekeeping for drive /org/freedesktop/UDisks2/drives/ST16000NM001G_--redacted--: Error updating SMART data: sk_disk_smart_read_data: Operation not supported (udisks-error-quark, 0)
dbus-daemon[4535]: [system] Activating service name='org.kde.kded.smart' requested by ':1.41' (uid=1000 pid=5873 comm="/usr/bin/kded5 ") (using servicehelper)
dbus-daemon[4535]: [system] Successfully activated service 'org.kde.kded.smart'
ksysguardd[4181945]: Disk device disappeared
ksysguardd[4181945]: Disk device disappeared
ksysguardd[4181945]: Disk device disappeared
ksysguardd[4181945]: Disk device disappeared

...

kernel: INFO: task txg_sync:3412555 blocked for more than 120 seconds.
kernel:       Tainted: P        W  OE     5.10.0-18-amd64 #1 Debian 5.10.140-1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: task:txg_sync        state:D stack:    0 pid:3412555 ppid:     2 flags:0x00004000
kernel: Call Trace:
kernel:  __schedule+0x282/0x880
kernel:  schedule+0x46/0xb0
kernel:  schedule_timeout+0x8b/0x150
kernel:  ? zio_nowait+0xb4/0x1a0 [zfs]
kernel:  ? __next_timer_interrupt+0x110/0x110
kernel:  io_schedule_timeout+0x4c/0x80
kernel:  __cv_timedwait_common+0x12f/0x170 [spl]
kernel:  ? add_wait_queue_exclusive+0x70/0x70
kernel:  __cv_timedwait_io+0x15/0x20 [spl]
kernel:  zio_wait+0x129/0x2b0 [zfs]
kernel:  dsl_pool_sync+0xfe/0x520 [zfs]
kernel:  ? spa_suspend_async_destroy+0x50/0x50 [zfs]
kernel:  spa_sync+0x542/0xfa0 [zfs]
kernel:  ? mutex_lock+0xe/0x30
kernel:  ? spa_txg_history_init_io+0x105/0x110 [zfs]
kernel:  txg_sync_thread+0x2e0/0x4a0 [zfs]
kernel:  ? txg_fini+0x260/0x260 [zfs]
kernel:  thread_generic_wrapper+0x6f/0x80 [spl]
kernel:  ? __thread_exit+0x20/0x20 [spl]
kernel:  kthread+0x11b/0x140
kernel:  ? __kthread_bind_mask+0x60/0x60
kernel:  ret_from_fork+0x22/0x30
kernel: INFO: task receive_writer:3414673 blocked for more than 120 seconds.
kernel:       Tainted: P        W  OE     5.10.0-18-amd64 #1 Debian 5.10.140-1
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kernel: task:receive_writer  state:D stack:    0 pid:3414673 ppid:     2 flags:0x00004000
kernel: Call Trace:
kernel:  __schedule+0x282/0x880
kernel:  ? __schedule+0x28a/0x880
kernel:  schedule+0x46/0xb0
kernel:  io_schedule+0x42/0x70
kernel:  cv_wait_common+0xac/0x130 [spl]
kernel:  ? add_wait_queue_exclusive+0x70/0x70
kernel:  txg_wait_synced_impl+0xcd/0x120 [zfs]
kernel:  txg_wait_synced+0xc/0x40 [zfs]
kernel:  dmu_tx_wait+0x380/0x390 [zfs]
kernel:  dmu_tx_assign+0x16d/0x480 [zfs]
kernel:  receive_object+0x296/0xca0 [zfs]
kernel:  ? dnode_rele_and_unlock+0x5c/0xd0 [zfs]
kernel:  ? flush_write_batch+0x1e2/0x560 [zfs]
kernel:  receive_writer_thread+0x1cc/0xad0 [zfs]
kernel:  ? thread_generic_wrapper+0x62/0x80 [spl]
kernel:  ? kfree+0x410/0x490
kernel:  ? receive_process_write_record+0x1a0/0x1a0 [zfs]
kernel:  ? thread_generic_wrapper+0x6f/0x80 [spl]
kernel:  thread_generic_wrapper+0x6f/0x80 [spl]
kernel:  ? __thread_exit+0x20/0x20 [spl]
kernel:  kthread+0x11b/0x140
kernel:  ? __kthread_bind_mask+0x60/0x60
kernel:  ret_from_fork+0x22/0x30

tried to mount the pool, got "pool I/O is currently suspended" After turning off and then on the drive, SMART short self test found no error, PASSED.

sde poolname was SUSPENDED, "One or more devices are faulted in response to IO failures." STATE: DEGRADED, too many errors after running "sudo zpool clear poolname" -> poolname is online ("One or more devices has experienced an error resulting in data corruption. Applications may be affected.")

After trying to run syncoid again, it seems that it skipped to next dataset (previous was not completed most likely according to progress that i was monitoring).

After some minutes i heard unusualy sound from drive and see in journal:

kernel: sd 8:0:0:0: [sde] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s
kernel: sd 8:0:0:0: [sde] tag#0 Sense Key : Medium Error [current] 
kernel: sd 8:0:0:0: [sde] tag#0 Add. Sense: Peripheral device write fault
kernel: sd 8:0:0:0: [sde] tag#0 CDB: Write(16) 8a 00 00 00 00 04 8c 8d 38 48 00 00 00 f0 00 00
kernel: print_req_error: 46 callbacks suppressed
kernel: blk_update_request: I/O error, dev sde, sector 19537934408 op 0x1:(WRITE) flags 0x4700 phys_seg 24 prio class 0
kernel: zio pool=poolname vdev=/dev/disk/by-id/ata-ST16000N--redacted---part1 error=5 type=2 offset=10003420631040 size=1048576 flags=40080c80

drive status now:

status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-9P
  scan: resilvered 11.5M in 00:00:34 with 0 errors on datehere

Thank you ♥️

rincebrain commented 2 years ago

Your disks appear to be dying.

There's not really much ZFS can do if a non-redundant disk backing a pool goes away other than hope it comes back and doesn't immediately go away if it asks again.

jimsalterjrs commented 2 years ago

kernel: blk_update_request: I/O error, dev sde, sector 19537934408 op 0x1:(WRITE) flags 0x4700 phys_seg 24 prio class 0

This is a hardware I/O error. Hardware errors exist at a level well below ZFS or even the kernel itself. If you need assistance figuring out how to deal with a direct hardware error, you're looking for either a general Linux support community, or a trustworthy local PC repair shop, depending on how much hand holding you need in dealing with it.

In short, you're experiencing hardware failure in one of the following, ranked in order of decreasing likelihood:

  1. the SATA or SAS cable leading to the disk at /dev/sde
  2. The disk at /dev/sde itself
  3. The SATA or SAS controller /dev/sde is connected to
  4. something even further north: eg RAM, CPU, motherboard, or power supply
stale[bot] commented 11 months ago

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.