Closed darrenfreeman closed 3 years ago
Thanks for reporting. Can you show the smartctl --all
output?
Here are some more details, the WD drives may also have a firmware bug:
https://blocksandfiles.com/2020/04/15/shingled-drives-have-non-shingled-zones-for-caching-writes/
Here are some more details, the WD drives may also have a firmware bug:
https://blocksandfiles.com/2020/04/15/shingled-drives-have-non-shingled-zones-for-caching-writes/
Perticuklar the last part is important for people who want to resilver:
I was able to force the WD40EFAX to resilver – by switching off write caching and lookahead. This dropped the drive’s write speed to less than 6MB/sec and the resilvering took 8 days instead of the more usual 24 hours. More worryingly, once added, a ZFS SCRUB (RAID integrity check) has yet to successfully complete without that drive producing checksum errors, even after 5 days of trying.
That last entry is mine.
This is definitely a WD firmware bug. 3 different drives showing the same errors at more or less the same points in the resilvering process is pretty damning.
I strongly suggest that people DO NOT attempt to replicate my attempts to force a drive into the array to see what happens next. I'm still cleaning up the mess. ZFS kicked it out for good reason.
WD are (of course) continuing to deny there's a problem or that they have firmware issues.
If you wish to see the drive's internal error log, use smartcl --xall (or smartctl -x)
The extended GP log will not show up with the --all (-a) flag
smartctl --xall /dev/sdj
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-4.19.0-8-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: WDC WD40EFAX-68JH4N0
Serial Number: WD-...
LU WWN Device Id: ...
Firmware Version: 82.00A82
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Apr 17 12:44:20 2020 AEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is: Unavailable
APM feature is: Unavailable
Rd look-ahead is: Enabled
Write cache is: Enabled
DSN feature is: Unavailable
ATA Security is: Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (21404) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 525) minutes.
Conveyance self-test routine
recommended polling time: ( 3) minutes.
SCT capabilities: (0x3039) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAGS VALUE WORST THRESH FAIL RAW_VALUE
1 Raw_Read_Error_Rate POSR-K 200 200 051 - 0
3 Spin_Up_Time POS--K 100 253 021 - 0
4 Start_Stop_Count -O--CK 100 100 000 - 1
5 Reallocated_Sector_Ct PO--CK 200 200 140 - 0
7 Seek_Error_Rate -OSR-K 200 200 000 - 0
9 Power_On_Hours -O--CK 100 100 000 - 540
10 Spin_Retry_Count -O--CK 100 253 000 - 0
11 Calibration_Retry_Count -O--CK 100 253 000 - 0
12 Power_Cycle_Count -O--CK 100 100 000 - 1
192 Power-Off_Retract_Count -O--CK 200 200 000 - 0
193 Load_Cycle_Count -O--CK 200 200 000 - 7
194 Temperature_Celsius -O---K 126 113 000 - 21
196 Reallocated_Event_Count -O--CK 200 200 000 - 0
197 Current_Pending_Sector -O--CK 200 200 000 - 0
198 Offline_Uncorrectable ----CK 100 253 000 - 0
199 UDMA_CRC_Error_Count -O--CK 200 200 000 - 0
200 Multi_Zone_Error_Rate ---R-- 200 200 000 - 0
||||||_ K auto-keep
|||||__ C event count
||||___ R error rate
|||____ S speed/performance
||_____ O updated online
|______ P prefailure warning
General Purpose Log Directory Version 1
SMART Log Directory Version 1 [multi-sector log support]
Address Access R/W Size Description
0x00 GPL,SL R/O 1 Log Directory
0x01 SL R/O 1 Summary SMART error log
0x02 SL R/O 5 Comprehensive SMART error log
0x03 GPL R/O 6 Ext. Comprehensive SMART error log
0x04 GPL R/O 256 Device Statistics log
0x04 SL R/O 8 Device Statistics log
0x06 SL R/O 1 SMART self-test log
0x07 GPL R/O 1 Extended self-test log
0x09 SL R/W 1 Selective self-test log
0x0c GPL R/O 2048 Pending Defects log
0x10 GPL R/O 1 NCQ Command Error log
0x11 GPL R/O 1 SATA Phy Event Counters log
0x24 GPL R/O 294 Current Device Internal Status Data log
0x30 GPL,SL R/O 9 IDENTIFY DEVICE data log
0x80-0x9f GPL,SL R/W 16 Host vendor specific log
0xa0-0xa7 GPL,SL VS 16 Device vendor specific log
0xa8-0xb6 GPL,SL VS 1 Device vendor specific log
0xb7 GPL,SL VS 78 Device vendor specific log
0xb9 GPL,SL VS 4 Device vendor specific log
0xbd GPL,SL VS 1 Device vendor specific log
0xc0 GPL,SL VS 1 Device vendor specific log
0xc1 GPL VS 93 Device vendor specific log
0xe0 GPL,SL R/W 1 SCT Command/Status
0xe1 GPL,SL R/W 1 SCT Data Transfer
SMART Extended Comprehensive Error Log Version: 1 (6 sectors)
Device Error Count: 7487 (device log contains only the most recent 24 errors)
CR = Command Register
FEATR = Features Register
COUNT = Count (was: Sector Count) Register
LBA_48 = Upper bytes of LBA High/Mid/Low Registers ] ATA-8
LH = LBA High (was: Cylinder High) Register ] LBA
LM = LBA Mid (was: Cylinder Low) Register ] Register
LL = LBA Low (was: Sector Number) Register ]
DV = Device (was: Device/Head) Register
DC = Device Control Register
ER = Error register
ST = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 7487 [22] occurred at disk power-on lifetime: 540 hours (22 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
10 -- 51 00 00 00 01 d1 c0 bc 10 40 00 Error: IDNF at LBA = 0x1d1c0bc10 = 7814036496
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
61 00 38 00 20 00 01 a8 8f ca 78 40 00 22d+12:23:01.073 WRITE FPDMA QUEUED
61 00 18 00 18 00 01 a8 8f ca b8 40 00 22d+12:23:01.073 WRITE FPDMA QUEUED
61 00 10 00 10 00 00 02 32 ba 10 40 00 22d+12:23:01.073 WRITE FPDMA QUEUED
61 00 10 00 08 00 01 d1 c0 bc 10 40 00 22d+12:23:01.073 WRITE FPDMA QUEUED
61 00 10 00 00 00 01 d1 c0 ba 10 40 00 22d+12:23:01.073 WRITE FPDMA QUEUED
Error 7486 [21] occurred at disk power-on lifetime: 540 hours (22 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
10 -- 51 00 00 00 01 a8 8f c9 78 40 00 Error: IDNF at LBA = 0x1a88fc978 = 7122962808
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
61 00 58 00 00 00 01 a8 8f c9 98 40 00 22d+12:22:53.109 WRITE FPDMA QUEUED
61 00 18 00 20 00 01 a5 64 0d e8 40 00 22d+12:22:52.710 WRITE FPDMA QUEUED
61 00 10 00 18 00 01 a5 64 0d d8 40 00 22d+12:22:52.383 WRITE FPDMA QUEUED
61 00 10 00 10 00 01 a5 64 0d c8 40 00 22d+12:22:52.368 WRITE FPDMA QUEUED
61 00 18 00 08 00 01 a8 8f c9 78 40 00 22d+12:22:52.368 WRITE FPDMA QUEUED
Error 7485 [20] occurred at disk power-on lifetime: 540 hours (22 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
10 -- 51 00 00 00 01 a8 8e f9 78 40 00 Error: IDNF at LBA = 0x1a88ef978 = 7122909560
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
61 00 38 00 00 00 01 a8 8e f9 98 40 00 22d+12:22:15.406 WRITE FPDMA QUEUED
61 00 40 00 38 00 01 a5 64 07 e8 40 00 22d+12:22:15.152 WRITE FPDMA QUEUED
61 00 40 00 30 00 01 a5 64 07 a8 40 00 22d+12:22:14.895 WRITE FPDMA QUEUED
61 00 20 00 28 00 01 a8 fc 7f 68 40 00 22d+12:22:14.895 WRITE FPDMA QUEUED
61 00 28 00 20 00 01 a8 fc 7f 40 40 00 22d+12:22:14.895 WRITE FPDMA QUEUED
Error 7484 [19] occurred at disk power-on lifetime: 540 hours (22 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
10 -- 51 00 00 00 01 80 96 22 e8 40 00 Error: IDNF at LBA = 0x1809622e8 = 6452290280
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
61 00 10 00 18 00 01 d1 c0 bc 10 40 00 22d+12:21:34.814 WRITE FPDMA QUEUED
61 00 10 00 10 00 01 d1 c0 ba 10 40 00 22d+12:21:34.603 WRITE FPDMA QUEUED
61 00 10 00 08 00 00 02 32 ba 10 40 00 22d+12:21:34.406 WRITE FPDMA QUEUED
61 00 08 00 00 00 01 80 96 27 10 40 00 22d+12:21:34.362 WRITE FPDMA QUEUED
61 00 08 00 20 00 01 80 96 22 e8 40 00 22d+12:21:33.737 WRITE FPDMA QUEUED
Error 7483 [18] occurred at disk power-on lifetime: 540 hours (22 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
10 -- 51 00 00 00 01 a9 39 f2 e0 40 00 Error: IDNF at LBA = 0x1a939f2e0 = 7134114528
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
61 00 08 00 00 00 01 a9 3d 99 60 40 00 22d+12:21:26.189 WRITE FPDMA QUEUED
61 00 08 00 08 00 01 a9 39 f2 e0 40 00 22d+12:21:26.189 WRITE FPDMA QUEUED
61 00 08 00 00 00 01 a9 38 d8 10 40 00 22d+12:21:26.189 WRITE FPDMA QUEUED
61 00 08 00 08 00 01 a9 38 81 68 40 00 22d+12:21:26.189 WRITE FPDMA QUEUED
61 00 08 00 00 00 01 a8 d2 c5 30 40 00 22d+12:21:26.188 WRITE FPDMA QUEUED
Error 7482 [17] occurred at disk power-on lifetime: 540 hours (22 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
10 -- 51 00 00 00 01 a8 8d 5f d8 40 00 Error: IDNF at LBA = 0x1a88d5fd8 = 7122804696
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
61 00 98 00 20 00 01 a8 8d 5f d8 40 00 22d+12:20:49.161 WRITE FPDMA QUEUED
60 00 10 00 18 00 01 d1 c0 bc 10 40 00 22d+12:20:49.159 READ FPDMA QUEUED
60 00 10 00 10 00 01 d1 c0 ba 10 40 00 22d+12:20:49.159 READ FPDMA QUEUED
60 00 10 00 08 00 00 02 32 ba 10 40 00 22d+12:20:49.159 READ FPDMA QUEUED
61 00 08 00 00 00 01 a8 8c 90 e8 40 00 22d+12:20:49.159 WRITE FPDMA QUEUED
Error 7481 [16] occurred at disk power-on lifetime: 540 hours (22 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
10 -- 51 00 00 00 01 a8 8d 5f 98 40 00 Error: IDNF at LBA = 0x1a88d5f98 = 7122804632
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
61 00 10 00 18 00 01 a5 64 05 00 40 00 22d+12:20:46.550 WRITE FPDMA QUEUED
61 00 18 00 10 00 01 a8 fa 57 60 40 00 22d+12:20:46.534 WRITE FPDMA QUEUED
61 00 18 00 08 00 01 a8 8d 5f b8 40 00 22d+12:20:42.224 WRITE FPDMA QUEUED
61 00 18 00 00 00 01 a8 8d 5f 98 40 00 22d+12:20:42.223 WRITE FPDMA QUEUED
61 00 18 00 08 00 01 a8 8d 5f 78 40 00 22d+12:20:42.222 WRITE FPDMA QUEUED
Error 7480 [15] occurred at disk power-on lifetime: 540 hours (22 days + 12 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER -- ST COUNT LBA_48 LH LM LL DV DC
-- -- -- == -- == == == -- -- -- -- --
10 -- 51 00 00 00 01 d1 c0 bc 10 40 00 Error: IDNF at LBA = 0x1d1c0bc10 = 7814036496
Commands leading to the command that caused the error were:
CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name
-- == -- == -- == == == -- -- -- -- -- --------------- --------------------
61 00 18 00 20 00 01 a5 64 03 58 40 00 22d+12:19:13.755 WRITE FPDMA QUEUED
61 00 28 00 18 00 01 a8 f8 79 58 40 00 22d+12:19:13.745 WRITE FPDMA QUEUED
61 00 20 00 00 00 01 a8 f8 7a d0 40 00 22d+12:19:13.743 WRITE FPDMA QUEUED
61 00 10 00 10 00 00 02 32 ba 10 40 00 22d+12:19:08.985 WRITE FPDMA QUEUED
61 00 10 00 08 00 01 d1 c0 bc 10 40 00 22d+12:19:08.985 WRITE FPDMA QUEUED
SMART Extended Self-test Log Version: 1 (1 sectors)
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 0 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SCT Status Version: 3
SCT Version (vendor specific): 258 (0x0102)
SCT Support Level: 1
Device State: Active (0)
Current Temperature: 21 Celsius
Power Cycle Min/Max Temperature: 16/31 Celsius
Lifetime Min/Max Temperature: 16/31 Celsius
Under/Over Temperature Limit Count: 0/0
Vendor specific:
01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
SCT Temperature History Version: 2
Temperature Sampling Period: 1 minute
Temperature Logging Interval: 1 minute
Min/Max recommended Temperature: 0/65 Celsius
Min/Max Temperature Limit: -41/85 Celsius
Temperature History Size (Index): 478 (46)
Index Estimated Time Temperature Celsius
47 2020-04-17 04:47 26 *******
48 2020-04-17 04:48 26 *******
49 2020-04-17 04:49 26 *******
50 2020-04-17 04:50 27 ********
51 2020-04-17 04:51 26 *******
... ..( 6 skipped). .. *******
58 2020-04-17 04:58 26 *******
59 2020-04-17 04:59 25 ******
... ..( 9 skipped). .. ******
69 2020-04-17 05:09 25 ******
70 2020-04-17 05:10 24 *****
... ..(115 skipped). .. *****
186 2020-04-17 07:06 24 *****
187 2020-04-17 07:07 25 ******
... ..( 12 skipped). .. ******
200 2020-04-17 07:20 25 ******
201 2020-04-17 07:21 26 *******
... ..( 7 skipped). .. *******
209 2020-04-17 07:29 26 *******
210 2020-04-17 07:30 27 ********
... ..( 77 skipped). .. ********
288 2020-04-17 08:48 27 ********
289 2020-04-17 08:49 26 *******
... ..( 11 skipped). .. *******
301 2020-04-17 09:01 26 *******
302 2020-04-17 09:02 25 ******
... ..( 11 skipped). .. ******
314 2020-04-17 09:14 25 ******
315 2020-04-17 09:15 24 *****
... ..( 19 skipped). .. *****
335 2020-04-17 09:35 24 *****
336 2020-04-17 09:36 23 ****
... ..( 19 skipped). .. ****
356 2020-04-17 09:56 23 ****
357 2020-04-17 09:57 24 *****
... ..( 11 skipped). .. *****
369 2020-04-17 10:09 24 *****
370 2020-04-17 10:10 25 ******
371 2020-04-17 10:11 24 *****
... ..( 4 skipped). .. *****
376 2020-04-17 10:16 24 *****
377 2020-04-17 10:17 23 ****
... ..( 9 skipped). .. ****
387 2020-04-17 10:27 23 ****
388 2020-04-17 10:28 24 *****
... ..( 15 skipped). .. *****
404 2020-04-17 10:44 24 *****
405 2020-04-17 10:45 25 ******
406 2020-04-17 10:46 24 *****
... ..( 13 skipped). .. *****
420 2020-04-17 11:00 24 *****
421 2020-04-17 11:01 25 ******
422 2020-04-17 11:02 24 *****
... ..( 16 skipped). .. *****
439 2020-04-17 11:19 24 *****
440 2020-04-17 11:20 23 ****
... ..( 5 skipped). .. ****
446 2020-04-17 11:26 23 ****
447 2020-04-17 11:27 22 ***
448 2020-04-17 11:28 23 ****
449 2020-04-17 11:29 23 ****
450 2020-04-17 11:30 23 ****
451 2020-04-17 11:31 22 ***
452 2020-04-17 11:32 23 ****
453 2020-04-17 11:33 23 ****
454 2020-04-17 11:34 23 ****
455 2020-04-17 11:35 24 *****
456 2020-04-17 11:36 23 ****
457 2020-04-17 11:37 23 ****
458 2020-04-17 11:38 24 *****
459 2020-04-17 11:39 24 *****
460 2020-04-17 11:40 24 *****
461 2020-04-17 11:41 23 ****
462 2020-04-17 11:42 24 *****
... ..( 17 skipped). .. *****
2 2020-04-17 12:00 24 *****
3 2020-04-17 12:01 23 ****
... ..( 12 skipped). .. ****
16 2020-04-17 12:14 23 ****
17 2020-04-17 12:15 22 ***
18 2020-04-17 12:16 23 ****
... ..( 5 skipped). .. ****
24 2020-04-17 12:22 23 ****
25 2020-04-17 12:23 24 *****
26 2020-04-17 12:24 24 *****
27 2020-04-17 12:25 23 ****
28 2020-04-17 12:26 19 -
29 2020-04-17 12:27 19 -
30 2020-04-17 12:28 21 **
31 2020-04-17 12:29 21 **
32 2020-04-17 12:30 22 ***
... ..( 9 skipped). .. ***
42 2020-04-17 12:40 22 ***
43 2020-04-17 12:41 23 ****
44 2020-04-17 12:42 23 ****
45 2020-04-17 12:43 23 ****
46 2020-04-17 12:44 21 **
SCT Error Recovery Control:
Read: 70 (7.0 seconds)
Write: 70 (7.0 seconds)
Device Statistics (GP/SMART Log 0x04) not supported
Pending Defects log (GP Log 0x0c) supported [please try: '-l defects']
SATA Phy Event Counters (GP Log 0x11)
ID Size Value Description
0x0001 2 0 Command failed due to ICRC error
0x0002 2 0 R_ERR response for data FIS
0x0003 2 0 R_ERR response for device-to-host data FIS
0x0004 2 0 R_ERR response for host-to-device data FIS
0x0005 2 0 R_ERR response for non-data FIS
0x0006 2 0 R_ERR response for device-to-host non-data FIS
0x0007 2 0 R_ERR response for host-to-device non-data FIS
0x0008 2 0 Device-to-host non-data FIS retries
0x0009 2 82 Transition from drive PhyRdy to drive PhyNRdy
0x000a 2 83 Device-to-host register FISes sent due to a COMRESET
0x000b 2 0 CRC errors within host-to-device FIS
0x000d 2 0 Non-CRC errors within host-to-device FIS
0x000f 2 0 R_ERR response for host-to-device data FIS, CRC
0x0012 2 0 R_ERR response for host-to-device non-data FIS, CRC
0x8000 4 1945390 Vendor specific
It's worth noting that I usually fill a brand new drive with pseudorandom data, and then verify that it reads back correctly, before zeroing it and using it. I believe I did this, and of course I didn't know to TRIM it afterwards.
The write error count, as reported by zpool status, keeps dropping to lower numbers. I think it's probably overflowing at 65.5K errors, is that correct?
Trimming the drive won't make any difference. If anything a Trimmed or virgin drive exhibits more errors than a zero-filled one. (The advantage of having 3 such drives is you can verify this behaviour and know it's not "a bad drive" - and that it's not "excess vibration", as WD tried to paint it, by putting the drives on individual foam blocks out of the chassis...)
Given all the IDNF errors that it's getting right now, I probably didn't test the drive and zero it. Failing at 28% likely corresponds to the amount that I previously wrote.
I zeroed the partition, and tried doing the replace / resilver again. It's failed in the same way, at about 26%. The errors are still IDNF.
I would take the drive out of the array and see if it passes a 4 pass data destructive badblocks test. This should take roughly 40 to 60 hours to complete on a 4TB drive.
I filled the partition with pseudorandom data, and then tried the replace/resilver again. (In case the drive is doing something like a TRIM when you simply zero it.) But that didn't help, it's failed at 27%, spewing write errors as before.
I will now do a four pass destructive badblocks test on the full drive, no partitions.
badblocks -wsv -b 4096 -o badblocks.txt /dev/sdj
98.30% done, 65:08:33 elapsed. (0/0/0 errors)
Pass completed, 0 bad blocks found. (0/0/0 errors)
Read/write speeds were similar, starting the pass at 180 MB/s, and gradually dropping to 80 MB/s by the end of the pass.
How about vibrations? If the pool was idling, that could explain the success of badblocks. I've experienced this with WD Reds and non-server chassis.
The pool is in a Supermicro server, so I doubt vibration has anything to do with it. Older WD Reds have no issues.
I bought a new Seagate IronWolf ST4000VN008-2DR166 for comparison. Running through identical steps, first is a long self-test, then I'll fill it with pseudorandom data and read it back, then I'll zero it and create a labelled GPT partition, before trying to resilver as before.
The ZFS resilver process is not completely sequential, and I think the random part is what brings the SMR WD RED to its knees: the drive firmware evidently has issues writing on different CMR/SMR bands, causing tremendous latency and, finally, a dropped drive.
A pure sequential fill workload should not put this kind of stress on a SMR driver, resulting in no apparent problems and even showing quite good results (ie: the 180-80 MB/s transfer above).
In this instance it's a drive firmware issue and I think ZFS is doing the right thing.
I've been seeing this too on 3 different drives - it's not a vibration issue. I verified that by isolating the drives on blocks of foam.
Based on Manfred Berger's 2015 OpenZFS Paris SMR presentation and some reading of SMR documentation, I believe the IDNF is a "relatively normal" error for SMR drives that occurs as the head seeks to EOD in SMR zones (just past end of written data) when switching between SMR zones (there are a number of open zones on a DM-SMR drive) and should be masked by the drive - except that WD screwed up.
WD have just issued a new blog update: https://blog.westerndigital.com/wd-red-nas-drives/ (2020/4/22) and given a list of SMR drives, with more details coming.
I've been dealing with WD over this issue for a few weeks and I suspect their sudden change of stance has something to do with my pointing out that there have been a number of consumer protection agency complaints filed around the world over the last week since stories appeared about them putting DM-SMR drives into the channel(*) whilst blowing off user complaints and passing on a UK legal opinion yesterday that in a british court they'd most likely to be found to have have acted with misfeasance, not just negligently (IANAL, but a friend is a magistrate) along with an observation that the continued gaslighting of customers they've been posting is likely to compound the penalties in judgements against them, especially as Seagate issued a statement that SMR and NAS don't mix: https://arstechnica.com/information-technology/2020/04/seagate-says-network-attached-storage-and-smr-dont-mix/, their own documentation says much the same thing in various locations http://zonedstorage.io/introduction/smr/ and Manfred Berger stated it in his presentation too - https://www.youtube.com/watch?v=a2lnMxMUxyc - around the 13 minute mark.
Under the law in most countries, if you advertise something as a NAS/RAID drive and it won't work as a NAS/RAID drive, the that's a very bad thing (unfit for the purpose for which it was marketed) - especially if you change it without notice from a device that worked perfectly to something that doesn't work - or when it does work, it may take a 90-98% impact in performance from the previous device if you don't treat it "just right" (RAIDZ3 resilvers going from 12-20 hours on 19*2TB drives out to 8-10 DAYS, and scrubs failing to complete or taking 2 weeks - all on idle vdevs) - if you then issue the kinds of statements that are in the rest of that WD blog, the courts really won't like it - and when it's being punted as a home/SOHO device then consumer protection laws kick in falrly hard.
(*) https://arstechnica.com/gadgets/2020/04/caveat-emptor-smr-disks-are-being-submarined-into-unexpected-channels/ https://blocksandfiles.com/2020/04/14/wd-red-nas-drives-shingled-magnetic-recording/ https://blocksandfiles.com/2020/04/15/seagate-2-4-and-8tb-barracuda-and-desktop-hdd-smr/ https://blocksandfiles.com/2020/04/16/toshiba-desktop-disk-drives-undocumented-shingle-magnetic-recording/ https://blocksandfiles.com/2020/04/23/western-digital-blog-wd-red-nas-smr-drives-overuse/
Finished resilvering my brand new IronWolf ST4000VN008. No surprises whatsoever.
pool: rpool
state: ONLINE
scan: resilvered 3.38T in 0 days 13:51:05 with 0 errors on Sat Apr 25 23:16:10 2020
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
rpool0 ONLINE 0 0 0
rpool5 ONLINE 0 0 0
rpool2 ONLINE 0 0 0
rpool3 ONLINE 0 0 0
rpool4 ONLINE 0 0 0
errors: No known data errors
Scrub completed without any errors.
After the resilver and the scrub, I then took the new IronWolf offline, and did a sequential copy to the Red EFAX, using dd.
I was unable do a zpool replace -f
, due to issues #3242 and #7866. However I created a symlink from the old device name, pointing to the new name, and then zpool online
succeeded. (Next time, I would just change the partition label.)
During the resilver, the EFAX threw a couple of write errors, which from the kernel log, appeared to be the same IDNF error. It's worth noting that in all the previous configurations, the EFAX was on an IBM SAS2008 card, but now it is the only device on the motherboard's SATA controller, due to a shortage of cables.
Apr 29 01:07:35 host kernel: sd 1:0:0:0: [sdm] tag#27 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 29 01:07:35 host kernel: sd 1:0:0:0: [sdm] tag#27 Sense Key : Illegal Request [current]
Apr 29 01:07:35 host kernel: sd 1:0:0:0: [sdm] tag#27 Add. Sense: Logical block address out of range
Apr 29 01:07:35 host kernel: sd 1:0:0:0: [sdm] tag#27 CDB: Write(16) 8a 00 00 00 00 00 82 5e 66 28 00 00 00 10 00 00
Apr 29 01:07:35 host kernel: print_req_error: I/O error, dev sdm, sector 2187224616
Apr 29 01:11:35 host kernel: sd 1:0:0:0: [sdm] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Apr 29 01:11:35 host kernel: sd 1:0:0:0: [sdm] tag#19 Sense Key : Illegal Request [current]
Apr 29 01:11:35 host kernel: sd 1:0:0:0: [sdm] tag#19 Add. Sense: Logical block address out of range
Apr 29 01:11:35 host kernel: sd 1:0:0:0: [sdm] tag#19 CDB: Write(16) 8a 00 00 00 00 00 82 61 ab e8 00 00 00 08 00 00
Apr 29 01:11:35 host kernel: print_req_error: I/O error, dev sdm, sector 2187439080
pool: rpool
state: ONLINE
status: One or more devices has experienced an unrecoverable error. An
attempt was made to correct the error. Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
using 'zpool clear' or replace the device with 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-9P
scan: resilvered 7.26G in 0 days 00:04:24 with 0 errors on Wed Apr 29 01:18:52 2020
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
rpool0 ONLINE 0 0 0
rpool5 ONLINE 0 2 0
rpool2 ONLINE 0 0 0
rpool3 ONLINE 0 0 0
rpool4 ONLINE 0 0 0
errors: No known data errors
Finally, I did a scrub, and the scrub has completed without any errors.
This is hardly a nice solution to getting an EFAX resilvered, as it's time-consuming, and it costs you one drive worth of redundancy. But for anyone who is desperate enough to try, I can say that it does work. (Try to limit the amount of writes done to the pool during this process, in case the final resilver fails as before.)
FYI
TOSHIBA MN04 HDD (includes RV sensor) seems to have similar problem.
I've tried to replace MD04 series HDD (TOSHIBA) by MN04 series HDD, and got many read/write errors. I bought 2 MN04 drives and got similar result for both drives. No smart error is recorded. I haven't tried to find new firmware yet.
5月 23 06:57:17 dev.inter.es-i.jp kernel: ata1.00: exception Emask 0x50 SAct 0x80800 SErr 0x4090800 action 0xe frozen
5月 23 06:57:17 dev.inter.es-i.jp kernel: ata1.00: irq_stat 0x00400040, connection status changed
5月 23 06:57:17 dev.inter.es-i.jp kernel: ata1: SError: { HostInt PHYRdyChg 10B8B DevExch }
5月 23 06:57:17 dev.inter.es-i.jp kernel: ata1.00: failed command: READ FPDMA QUEUED
5月 23 06:57:17 dev.inter.es-i.jp kernel: ata1.00: cmd 60/d0:58:48:93:ff/07:00:10:00:00/40 tag 11 ncq dma 1024000 in
res 40/00:98:18:9b:ff/00:00:10:00:00/40 Emask 0x50 (ATA bus error)
5月 23 06:57:17 dev.inter.es-i.jp kernel: ata1.00: status: { DRDY }
5月 23 06:57:17 dev.inter.es-i.jp kernel: ata1.00: failed command: READ FPDMA QUEUED
5月 23 06:57:17 dev.inter.es-i.jp kernel: ata1.00: cmd 60/f0:98:18:9b:ff/07:00:10:00:00/40 tag 19 ncq dma 1040384 in
res 40/00:98:18:9b:ff/00:00:10:00:00/40 Emask 0x50 (ATA bus error)
5月 23 06:57:17 dev.inter.es-i.jp kernel: ata1.00: status: { DRDY }
5月 23 06:57:17 dev.inter.es-i.jp kernel: ata1: hard resetting link
5月 23 06:57:20 dev.inter.es-i.jp kernel: ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
5月 23 06:57:20 dev.inter.es-i.jp kernel: ata1.00: configured for UDMA/100
5月 23 06:57:20 dev.inter.es-i.jp kernel: sd 0:0:0:0: [sda] tag#11 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=3s
5月 23 06:57:20 dev.inter.es-i.jp kernel: sd 0:0:0:0: [sda] tag#11 Sense Key : Illegal Request [current]
5月 23 06:57:20 dev.inter.es-i.jp kernel: sd 0:0:0:0: [sda] tag#11 Add. Sense: Unaligned write command
5月 23 06:57:20 dev.inter.es-i.jp kernel: sd 0:0:0:0: [sda] tag#11 CDB: Read(16) 88 00 00 00 00 00 10 ff 93 48 00 00 07 d0 00 00
5月 23 06:57:20 dev.inter.es-i.jp kernel: blk_update_request: I/O error, dev sda, sector 285184840 op 0x0:(READ) flags 0x700 phys_seg 17 prio class 0
5月 23 06:57:20 dev.inter.es-i.jp kernel: zio pool=datastore vdev=/dev/disk/by-id/ata-TOSHIBA_MN04ACA400_4071K3DXF7ME-part1 error=5 type=1 offset=146013589504 size=1024000 flags=40080ca8
5月 23 06:57:20 dev.inter.es-i.jp kernel: sd 0:0:0:0: [sda] tag#19 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=3s
5月 23 06:57:20 dev.inter.es-i.jp kernel: sd 0:0:0:0: [sda] tag#19 Sense Key : Illegal Request [current]
5月 23 06:57:20 dev.inter.es-i.jp kernel: sd 0:0:0:0: [sda] tag#19 Add. Sense: Unaligned write command
5月 23 06:57:20 dev.inter.es-i.jp kernel: sd 0:0:0:0: [sda] tag#19 CDB: Read(16) 88 00 00 00 00 00 10 ff 9b 18 00 00 07 f0 00 00
5月 23 06:57:20 dev.inter.es-i.jp kernel: blk_update_request: I/O error, dev sda, sector 285186840 op 0x0:(READ) flags 0x700 phys_seg 18 prio class 0
5月 23 06:57:20 dev.inter.es-i.jp kernel: zio pool=datastore vdev=/dev/disk/by-id/ata-TOSHIBA_MN04ACA400_4071K3DXF7ME-part1 error=5 type=1 offset=146014613504 size=1040384 flags=40080ca8
5月 23 06:57:20 dev.inter.es-i.jp kernel: ata1: EH complete
5月 23 06:57:21 dev.inter.es-i.jp zed[698351]: eid=712 class=io pool_guid=0xE00223B81DC49FEF vdev_path=/dev/disk/by-id/ata-TOSHIBA_MN04ACA400_4071K3DXF7ME-part1
5月 23 06:57:21 dev.inter.es-i.jp zed[698357]: eid=713 class=io pool_guid=0xE00223B81DC49FEF vdev_path=/dev/disk/by-id/ata-TOSHIBA_MN04ACA400_4071K3DXF7ME-part1
5月 23 06:57:21 dev.inter.es-i.jp zed[699834]: eid=714 class=io pool_guid=0xE00223B81DC49FEF vdev_path=/dev/disk/by-id/ata-TOSHIBA_MN04ACA400_4071K3DXF7ME-part1
I've also found this issue with a WD Blue WD20SPZX, which is a 2.5" 2TB drive that I have in a laptop.
One day I attempted a transfer of ~500 GB of files to an NTFS partition under Windows, which got slower and slower until I killed the transfer. After rebooting to Ubuntu, it then also generated lengthy timeouts with zfs. This was a near new drive, so I thought I must have a faulty SATA controller. I mirrored the zpool with an external USB3 drive, and then dropped it out of the mirror.
Now that I know about this issue with WD drives, I can confirm the internal drive is behaving just like the WD40EFAX, it seems to not be faulty, but it won't resilver into the mirror. It eventually throws thousands of write errors, until zfs faults it.
I have 4 WDC_WD30EFRX installed in HP Microserver Gen8 connected to motherboard sata controller configured in AHCI mode. At first there were 2 drives in ZFS Mirror configuration and everything was fine. After few months I added 2 more drives and upgraded mirror to RAID10. Since then I'm getting random write errors, most likely during heavy write load. Scrub finishes without any error. No SMART errors.
[213957.995776] ata3.00: exception Emask 0x0 SAct 0xc000000 SErr 0x0 action 0x0 [213957.996438] ata3.00: irq_stat 0x40000008 [213957.996822] ata3.00: failed command: WRITE FPDMA QUEUED [213957.997206] ata3.00: cmd 61/00:d0:f8:49:18/08:00:e3:00:00/40 tag 26 ncq dma 1048576 ou res 41/10:00:f8:49:18/00:00:e3:00:00/40 Emask 0x481 (invalid argument)
[213957.998508] ata3.00: status: { DRDY ERR } [213957.998880] ata3.00: error: { IDNF } [213958.000293] ata3.00: configured for UDMA/133 [213958.000689] sd 3:0:0:0: [sdg] tag#26 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [213958.001381] sd 3:0:0:0: [sdg] tag#26 Sense Key : Illegal Request [current] [213958.002070] sd 3:0:0:0: [sdg] tag#26 Add. Sense: Logical block address out of range [213958.002759] sd 3:0:0:0: [sdg] tag#26 CDB: Write(16) 8a 00 00 00 00 00 e3 18 49 f8 00 00 08 00 00 00 [213958.003455] blk_update_request: I/O error, dev sdg, sector 3810019832 op 0x1:(WRITE) flags 0x700 phys_seg 16 prio class 0
[213958.004180] zio pool=RAID1 vdev=/dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N4TKRDJR-part1 error=5 type=2 offset=1950729105408 size=1048576 flags=40080c80 [213958.005199] ata3: EH complete [213958.016612] ata1.00: exception Emask 0x0 SAct 0x280 SErr 0x0 action 0x0 [213958.017262] ata1.00: irq_stat 0x40000008 [213958.017612] ata1.00: failed command: WRITE FPDMA QUEUED [213958.017968] ata1.00: cmd 61/00:38:00:f4:da/08:00:4b:00:00/40 tag 7 ncq dma 1048576 ou res 41/10:00:00:f4:da/00:00:4b:00:00/40 Emask 0x481 (invalid argument)
[213958.019218] ata1.00: status: { DRDY ERR } [213958.019581] ata1.00: error: { IDNF } [213958.019944] ata4.00: exception Emask 0x0 SAct 0x180 SErr 0x0 action 0x0 [213958.020621] ata4.00: irq_stat 0x40000008 [213958.020997] ata4.00: failed command: WRITE FPDMA QUEUED [213958.021371] ata4.00: cmd 61/00:38:f8:52:18/08:00:e3:00:00/40 tag 7 ncq dma 1048576 ou res 41/10:00:f8:52:18/00:00:e3:00:00/40 Emask 0x481 (invalid argument)
[213958.022700] ata4.00: status: { DRDY ERR } [213958.023071] ata4.00: error: { IDNF } [213958.025547] ata1.00: configured for UDMA/133 [213958.025959] ata4.00: configured for UDMA/133 [213958.025963] sd 1:0:0:0: [sde] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [213958.026338] sd 4:0:0:0: [sdh] tag#7 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [213958.027017] sd 1:0:0:0: [sde] tag#7 Sense Key : Illegal Request [current] [213958.027718] sd 4:0:0:0: [sdh] tag#7 Sense Key : Illegal Request [current] [213958.028392] sd 1:0:0:0: [sde] tag#7 Add. Sense: Logical block address out of range [213958.029075] sd 4:0:0:0: [sdh] tag#7 Add. Sense: Logical block address out of range [213958.029078] sd 4:0:0:0: [sdh] tag#7 CDB: Write(16) 8a 00 00 00 00 00 e3 18 52 f8 00 00 08 00 00 00 [213958.029767] sd 1:0:0:0: [sde] tag#7 CDB: Write(16) 8a 00 00 00 00 00 4b da f4 00 00 00 08 00 00 00 [213958.030456] blk_update_request: I/O error, dev sdh, sector 3810022136 op 0x1:(WRITE) flags 0x700 phys_seg 16 prio class 0
[213958.031145] blk_update_request: I/O error, dev sde, sector 1272640512 op 0x1:(WRITE) flags 0x700 phys_seg 16 prio class 0
[213958.031840] zio pool=RAID1 vdev=/dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N6EVNSY9-part1 error=5 type=2 offset=1950730285056 size=1048576 flags=40080c80 [213958.031849] ata4: EH complete [213958.034629] zio pool=RAID1 vdev=/dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0505144-part1 error=5 type=2 offset=651590893568 size=1048576 flags=40080c80
I'm surprised to be seeing those IDNF errors alongside WD30EFRX. WD are maintaining that these are not SMR. See e.g. https://www.westerndigital.com/products/internal-drives/wd-red-hdd
Do you have any other drive model in that array?
I have three WD40EFRX Red NAS drives in mine, and they have performed without issue. All with firmware 82.00A82.
I was surprised to find that the WD40EFAX drive also had firmware 82.00A82, which leads me to wonder whether the DM-SMR functionality has been in the firmware all along, but they only just enabled it through the configuration data in those particular drives.
Can you find any difference between your old and new drives? Try hdparm -I
and see whether you see entries like:
* Data Set Management TRIM supported (limit 10 blocks)
* Deterministic read ZEROs after TRIM
Hi, I have only WD30EFRX in this RAID10 array. 2 older drives (few years old) are:
WDC WD30EFRX-68EUZN0
Serial Number: WD-WMC4N0505144 Firmware Revision: 80.00A80
looks like its older hardware rev. and the lastest firmware is 80.00A80
and the other 2 installed recently are
Model Number: WDC WD30EFRX-68EUZN0
Serial Number: WD-WCC4N4TKRDJR Firmware Revision: 82.00A82
The output of hdparm -I /dev/sd[e-h] |grep TRIM is empty, so this is not SMR model.
I have more drives in different arrays and I never had a single issue with them, but those are connected to HBA LSI controller. I thing I noticed after posting previous comment that 2 older drives had built-in APM timer value set to 300s (checked with idle3-tools and set to disabled) and the other 2 had APM disabled by default.
I don't know what is the origin of this issue but it affects randomly one or all drives in this array
I have 4 WDC_WD30EFRX installed in HP Microserver Gen8 connected to motherboard sata controller configured in AHCI mode. At first there were 2 drives in ZFS Mirror configuration and everything was fine. After few months I added 2 more drives and upgraded mirror to RAID10. Since then I'm getting random write errors, most likely during heavy write load. Scrub finishes without any error. No SMART errors.
HP Microserver Gen8 gave me various troubles with high capacity drives until I disabled AHCI and switched to IDE/Legacy mode. Give it a try.
I have 4 WDC_WD30EFRX installed in HP Microserver Gen8 connected to motherboard sata controller configured in AHCI mode. At first there were 2 drives in ZFS Mirror configuration and everything was fine. After few months I added 2 more drives and upgraded mirror to RAID10. Since then I'm getting random write errors, most likely during heavy write load. Scrub finishes without any error. No SMART errors.
HP Microserver Gen8 gave me various troubles with high capacity drives until I disabled AHCI and switched to IDE/Legacy mode. Give it a try.
What kind of troubles? It worked fine with 2x3TB but throws errors with 4x3TB...
I checked differences between old and new drives but there's nothing that matter... sde=one of old drives sdg=one of new drives
< /dev/sde:
---
> /dev/sdg:
6,7c6,7
< Serial Number: WD-WMC4N0505144
< Firmware Revision: 80.00A80
---
> Serial Number: WD-WCC4N4TKRDJR
> Firmware Revision: 82.00A82
95,96c95,96
< 420min for SECURITY ERASE UNIT. 420min for ENHANCED SECURITY ERASE UNIT.
< Logical Unit WWN Device Identifier: 50014ee003a891af
---
> 416min for SECURITY ERASE UNIT. 416min for ENHANCED SECURITY ERASE UNIT.
> Logical Unit WWN Device Identifier: 50014ee20d6ce16e
99c99
< Unique ID : 003a891af
---
> Unique ID : 20d6ce16e
What kind of troubles? It worked fine with 2x3TB but throws errors with 4x3TB...
BIOS recognition issue and slow/unpredictable performance. After switching to IDE all issues disappeared.
See https://forums.unraid.net/topic/72428-solved-clearing-synology-drive/
It's possible your EFRX drive has corrupted sectors that can't be read. (Power loss during writing?) Maybe try zeroing the whole drive, and see if that solves it.
@darrenfreeman My issue is write not read related. Besides when this happen it affect many/random drives at once. I never experienced power loss. My server is connected to UPS.
Do you have ashift=12
? If it's smaller, then that will incur read-modify-write.
It looks like the same error that we see on the SMR drives, but yours aren't known to be SMR.
Can you try zeroing the whole drive anyway, and see if it throws those errors?
@darrenfreeman Yes, all zpools have ashift =12
sudo zdb -C |grep ashift ashift: 12 ashift: 12 ashift: 12 ashift: 12 ashift: 12 ashift: 12
I cant zero any drive, it contains important data. Besides, I think that it's pointless. As I mentioned before this error appears randomly on 1-4 drives in this zpool and random sectores in the same moment. Sometimes its drive 1, another time 1+3, another 1,2,3 or 1,2,4 so this can't be just single drive issue. Untill I expanded this zpool from 2 to 4 drives this error never happend. I think it has something to do with 4 drives working together, maybe SATA controller or RAID10 data distribution issue...
This has all the hallmarks of a power supply problem - especially as it only happened once you added more drives
Replace your PSU and see if the problems continue. Pay close attention to the wattage of the unit.
@Stoatwblr I don't think so. Total number of drives hasn't changed, just configuration. It was
Single, Single, 2 drives in RAID1 - total 4 drives
now it's
4 drives in RAID 10
I have 4 more drives (3x SDD + 1xHDD) connected to different SATA controller working without any problem since the beginning.
Putting everything together: I have 8 drives: 4 connected to motherboard SATA controller with described issue and another 4 connected to LSI HBA controller working flawlessly.
Replace the PSU anyway - It is the common factor
It could be a SATA issue.
Some SATA controllers are notorious for problems. One bad drive can throw errors on other drives, and it can throw other good drives out of your array. (I've had this happen.)
You could try moving one drive out of each mirror onto the LSI HBA, just to see if those drives stop having issues.
@darrenfreeman I'll do that but I have to buy new 8i LSI controller coz the one I have in my server is 4i only with all ports in use.
@QBANIN again, before buying a new HBA: try setting your SATA port to IDE mode (rather than AHCI).
@shodanshok Will do. Before switching to IDE I disabled Write Cache (in bios) for built-in controller. I'm runinng bonnie++ HDD stress test since few hours and so far no error appeared. Is it possible that this issue is somehow related to buggy Write Cache handling by ZFS?
BTW: I don't see any downsides of disabling Write Cache. Sequential write speed to this RAID10 zpool is >200MB/s . Can't test random write speed but is not as much important because this is typical storage pool for long term data uploaded via single FTP/SMB connection or backup software.
@QBANIN the disk own writeback cache can be useful when issuing random writes without invoking fsync
- which can happen quite a lot. The typical ZFS write pattern can lower the impact from disabling this cache, but I would avoid disabling it if possible.
ZFS has excellent support for disk caching - so no, it is really difficult (if not impossible) than it is ZFS fault. Theoretically, disabling disk caching can void problem only if the DRAM cache itself is broken, which is a very remot possibility if happening on multiple disks at the same time. But as suggested by @Stoatwblr it can be the power supply having problem and corrupting in-cache data (I saw something similar with an assembled NAS).
As suggested, re-enable the writeback cache and try to:
set SATA to IDE
change power supply (or to use another power rail at the very least)
@shodanshok I can't replace power supply but will test IDE legacy mode.
I was suggesting that you swap a couple of drives between the LSI HBA and the SATA controller. Same number of ports, but see if the problem follows the drive, or the controller.
@QBANIN the disk own writeback cache can be useful when issuing random writes without invoking
fsync
- which can happen quite a lot. The typical ZFS write pattern can lower the impact from disabling this cache, but I would avoid disabling it if possible.ZFS has excellent support for disk caching - so no, it is really difficult (if not impossible) than it is ZFS fault. Theoretically, disabling disk caching can void problem only if the DRAM cache itself is broken, which is a very remot possibility if happening on multiple disks at the same time. But as suggested by @Stoatwblr it can be the power supply having problem and corrupting in-cache data (I saw something similar with an assembled NAS).
As suggested, re-enable the writeback cache and try to:
* set SATA to IDE * change power supply (or to use another power rail at the very least)
You were right about IDE mode. The issue is gone after switching from AHCI to IDE. :D So far 9 days without any write error.
@QBANIN glad to hear that ;) The AHCI mode of these small HP Microservers Gen8 seems really problematic...
@shodanshok That's why I'm going to switch to some 8i LSI controller anyway :)
Since I had read errors on occasions with good HDDs, I replaced power supply and my problem is gone. Even if I didn't have any other issues but ZFS read failure, it seems the cause was faulty power supply.
It happened again few days after my last comment. Switched to LSI 9207-8i and... it happened again. It looks like this issue is affecting only WD RED 3,5" (CMR) drives from 1/4 up to 4/4 at once (4 this time). Remaining 2x HDD (WD blue 2,5" and Seagate 2,5") and 3x SSD drives work fine.
[czw sie 6 01:56:31 2020] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) [czw sie 6 01:56:31 2020] sd 0:0:7:0: [sdh] tag#4645 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [czw sie 6 01:56:31 2020] sd 0:0:7:0: [sdh] tag#4646 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK [czw sie 6 01:56:31 2020] sd 0:0:7:0: [sdh] tag#4645 Sense Key : Illegal Request [current] [descriptor] [czw sie 6 01:56:31 2020] sd 0:0:7:0: [sdh] tag#4645 Add. Sense: Logical block address out of range [czw sie 6 01:56:31 2020] sd 0:0:7:0: [sdh] tag#4645 CDB: Write(16) 8a 00 00 00 00 01 2c c1 14 28 00 00 08 00 00 00 [czw sie 6 01:56:31 2020] blk_update_request: critical target error, dev sdh, sector 5045818408 op 0x1:(WRITE) flags 0x700 phys_seg 16 prio class 0 [czw sie 6 01:56:31 2020] sd 0:0:7:0: [sdh] tag#4646 CDB: Write(16) 8a 00 00 00 00 01 2c c1 1c 28 00 00 04 00 00 00 [czw sie 6 01:56:31 2020] zio pool=RAID1 vdev=/dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0505144-part1 error=121 type=2 offset=2583457976320 size=1048576 flags=40080c80 [czw sie 6 01:56:31 2020] blk_update_request: I/O error, dev sdh, sector 5045820456 op 0x1:(WRITE) flags 0x700 phys_seg 8 prio class 0 [czw sie 6 01:56:31 2020] zio pool=RAID1 vdev=/dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0505144-part1 error=5 type=2 offset=2583459024896 size=524288 flags=40080c80 [czw sie 6 01:56:31 2020] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) [czw sie 6 01:56:31 2020] sd 0:0:4:0: [sde] tag#4653 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK [czw sie 6 01:56:31 2020] sd 0:0:4:0: [sde] tag#4653 CDB: Write(16) 8a 00 00 00 00 01 0a dc 9c 28 00 00 08 00 00 00 [czw sie 6 01:56:31 2020] blk_update_request: I/O error, dev sde, sector 4477197352 op 0x1:(WRITE) flags 0x700 phys_seg 16 prio class 0 [czw sie 6 01:56:31 2020] zio pool=RAID1 vdev=/dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N6EVNSY9-part1 error=5 type=2 offset=2292323995648 size=1048576 flags=40080c80 [czw sie 6 01:56:31 2020] sd 0:0:4:0: [sde] tag#4652 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [czw sie 6 01:56:31 2020] sd 0:0:4:0: [sde] tag#4652 Sense Key : Illegal Request [current] [descriptor] [czw sie 6 01:56:31 2020] sd 0:0:4:0: [sde] tag#4652 Add. Sense: Logical block address out of range [czw sie 6 01:56:31 2020] sd 0:0:4:0: [sde] tag#4652 CDB: Write(16) 8a 00 00 00 00 01 0a dc 94 28 00 00 08 00 00 00 [czw sie 6 01:56:31 2020] blk_update_request: critical target error, dev sde, sector 4477195304 op 0x1:(WRITE) flags 0x700 phys_seg 16 prio class 0 [czw sie 6 01:56:31 2020] zio pool=RAID1 vdev=/dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N6EVNSY9-part1 error=121 type=2 offset=2292322947072 size=1048576 flags=40080c80 [czw sie 6 01:56:32 2020] mpt2sas_cm0: log_info(0x31120436): originator(PL), code(0x12), sub_code(0x0436) [czw sie 6 01:56:32 2020] mpt2sas_cm0: log_info(0x31120436): originator(PL), code(0x12), sub_code(0x0436) [czw sie 6 01:56:32 2020] sd 0:0:5:0: [sdf] tag#4649 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK [czw sie 6 01:56:32 2020] sd 0:0:5:0: [sdf] tag#4649 CDB: Write(16) 8a 00 00 00 00 01 0a dc 8d 28 00 00 08 00 00 00 [czw sie 6 01:56:32 2020] blk_update_request: I/O error, dev sdf, sector 4477193512 op 0x1:(WRITE) flags 0x700 phys_seg 16 prio class 0 [czw sie 6 01:56:32 2020] zio pool=RAID1 vdev=/dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N4TKRDJR-part1 error=5 type=2 offset=2292322029568 size=1048576 flags=40080c80 [czw sie 6 01:56:32 2020] sd 0:0:5:0: [sdf] tag#4669 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK [czw sie 6 01:56:32 2020] sd 0:0:5:0: [sdf] tag#4669 CDB: Write(16) 8a 00 00 00 00 01 0a dc 95 28 00 00 08 00 00 00 [czw sie 6 01:56:32 2020] blk_update_request: I/O error, dev sdf, sector 4477195560 op 0x1:(WRITE) flags 0x700 phys_seg 16 prio class 0 [czw sie 6 01:56:32 2020] zio pool=RAID1 vdev=/dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N4TKRDJR-part1 error=5 type=2 offset=2292323078144 size=1048576 flags=40080c80 [czw sie 6 01:56:32 2020] sd 0:0:5:0: Power-on or device reset occurred [czw sie 6 01:56:32 2020] mpt2sas_cm0: log_info(0x31120436): originator(PL), code(0x12), sub_code(0x0436) [czw sie 6 01:56:32 2020] mpt2sas_cm0: log_info(0x31120436): originator(PL), code(0x12), sub_code(0x0436) [czw sie 6 01:56:32 2020] sd 0:0:6:0: [sdg] tag#4840 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK [czw sie 6 01:56:32 2020] sd 0:0:6:0: [sdg] tag#4644 FAILED Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK [czw sie 6 01:56:32 2020] sd 0:0:6:0: [sdg] tag#4840 CDB: Write(16) 8a 00 00 00 00 01 2c c1 0d 28 00 00 06 00 00 00 [czw sie 6 01:56:32 2020] sd 0:0:6:0: [sdg] tag#4644 CDB: Write(16) 8a 00 00 00 00 01 2c c1 05 28 00 00 08 00 00 00 [czw sie 6 01:56:32 2020] blk_update_request: I/O error, dev sdg, sector 5045816616 op 0x1:(WRITE) flags 0x700 phys_seg 12 prio class 0 [czw sie 6 01:56:32 2020] blk_update_request: I/O error, dev sdg, sector 5045814568 op 0x1:(WRITE) flags 0x700 phys_seg 16 prio class 0 [czw sie 6 01:56:32 2020] zio pool=RAID1 vdev=/dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0371995-part1 error=5 type=2 offset=2583457058816 size=786432 flags=40080c80 [czw sie 6 01:56:32 2020] zio pool=RAID1 vdev=/dev/disk/by-id/ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0371995-part1 error=5 type=2 offset=2583456010240 size=1048576 flags=40080c80 [czw sie 6 01:56:32 2020] sd 0:0:6:0: Power-on or device reset occurred [czw sie 6 01:56:32 2020] sd 0:0:5:0: Power-on or device reset occurred [czw sie 6 01:56:33 2020] sd 0:0:6:0: Power-on or device reset occurred
pool: RAID1 state: DEGRADED status: One or more devices are faulted in response to persistent errors. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Replace the faulted device, or use 'zpool clear' to mark the device repaired. scan: resilvered 54,8M in 0 days 00:00:04 with 0 errors on Thu Aug 6 01:56:45 2020 config:
NAME STATE READ WRITE CKSUM
RAID1 DEGRADED 0 0 0
mirror-0 DEGRADED 0 9 0
ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N4TKRDJR DEGRADED 0 18 3 too many errors
ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N6EVNSY9 FAULTED 0 18 0 too many errors
mirror-2 DEGRADED 0 0 0
ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0505144 FAULTED 0 14 0 too many errors
ata-WDC_WD30EFRX-68EUZN0_WD-WMC4N0371995 DEGRADED 0 16 13 too many errors
logs
ata-Samsung_SSD_840_PRO_Series_S12PNEAD414919H-part4 ONLINE 0 0 0
cache
sdc5 ONLINE 0 0 0
After zpool clear RAID1 (in fact it's raid 10) everything works like before. Scrub show no issue.
Closing. From the discussion it sounds like issues reported here were caused by hardware.
System information
Describe the problem you're observing
I purchased a brand new Western Digital WD40EFAX. Then after I purchased it, this article came out, saying they are likely not suitable for use with zfs, because they fail to resilver: https://blocksandfiles.com/2020/04/14/wd-red-nas-drives-shingled-magnetic-recording/
So to see if this is the case, I did a zpool replace of a known good drive for this new drive. And I can confirm that this drive is unable to resilver. It has effectively hanged at 27%, IOPS fell to zero, and the write error count is going up.
I will be returning this drive as not fit for purpose. I'll leave it trying to resilver for a few days, and so if anyone wants me to do further investigation, I can do this before returning it.
Note that issue #10182 also refers to performance issues caused by this model being SMR masquerading as CMR. The situation now seems far worse than that.
Describe how to reproduce the problem
zpool replace onto a Western Digital WD40EFAX.
Include any warning/errors/backtraces from the system logs