sown / tasks

Tasks for sown projects
0 stars 0 forks source link

Replace failed drive in vms-b53-1 #73

Closed TimStallard closed 2 years ago

TimStallard commented 4 years ago

vms-b53-1 has a failed drive, and the RAID controller has thrown it out the array.

root@vms-b53-1:~# smartctl -a -d megaraid,1 /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-101-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Constellation ES.3
Device Model:     ST1000NM0033-9ZM173
Serial Number:    Z1W16L25
LU WWN Device Id: 5 000c50 065b5eca8
Add. Product Id:  DELL(tm)
Firmware Version: GA06
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Enclosure Device ID: 32
Slot Number: 1
Enclosure position: 1
Device Id: 1

It's a 1TB 7.2k SATA drive.

We should probably obtain a replacement and swap it when we move the server to B53.

heliosfa commented 4 years ago

Remind me, is VMS a Dell server? if so, a re-seat of the drive may be all that is needed - their controllers are notoriously finicky.

That (or a drive swap) may be something that we can arrange access for.

drn05r commented 4 years ago

Yes it is a Dell server. @TimStallard do you know when the disk failed or have you only just added the monitoring for this? We have had a couple of power outage lately, which may have caused this disk failure. Not sure if VMS-B53-1 has been affected and/or whether the UPS has saved it from any brief power outage interludes.

heliosfa commented 4 years ago

Yeah, a power blip could certainly cause a drive to fall off a PERC array and a re-seat and re-build is the first thing to try for that.

TimStallard commented 4 years ago

The drive started failing a while back, and monitoring only picked it up once the RAID controller actually considered the drive as failed. First entry in the smartd log was:

Mar 26 18:32:55 vms-b53-1 smartd[1059]: Device: /dev/bus/0 [megaraid_disk_01] [SAT], 1 Currently unreadable (pending) sectors

We're currently at:

root@vms-b53-1:~# smartctl -a -d megaraid,1 /dev/sda
...
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       5
...
197 Current_Pending_Sector  0x0012   099   098   000    Old_age   Always       -       229

The RAID controller considered the drive as FAILED until I did a reboot last night - since then it seems to have removed the drive from the array entirely, the drive is currently Unconfigured(good).

drn05r commented 4 years ago

This is a bit moot at the moment due to lack of building access. Unless we have another disk fail it is not a major issue for now. And even so, the server does not have anything absolutely critical on it or if it did that would be being backed up with our (@TimStallard and @trickeydan) new ZFS backup solution.

TimStallard commented 4 years ago

Yeah, I think this is only worthwhile sorting out once we have physical access again and move the server to B53, and start putting services on it - it's effectively empty at the moment.

drn05r commented 4 years ago

I have enquired with my the faculty research systems manager if there is likely to still be a warranty in force, to see if we can get a free new disk. However, I think this is unlikely as the server is now over 6 years old.

TimStallard commented 4 years ago

Putting the service tag into dell's site suggests that the warranty expired 03 JAN 2017 :(

drn05r commented 2 years ago

Disk replaced and RAID1 array rebuild.

TimStallard commented 2 years ago

the replacement drive seems to have failed :(

root@vms-b53-1:~# smartctl -a /dev/sda -d sat+megaraid,01 | grep -i realloc
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       47

and the RAID controller kicked it out as failed

drn05r commented 2 years ago

I have emailed Lance about whether he is happy for me to order:

https://www.novatech.co.uk/products/seagate-barracuda-1tb-3-5-inch-desktop-hard-drive-hdd/st1000dm010.html

I think we agreed that getting a cheaper disk that meets basic requirements from a regular supplier is a safer option. Will try to get a replacement for the failed disk as well. However, don't know if I would trust using that disk for a server that may not have easy access when deployed in B53.

drn05r commented 2 years ago

A new drive (Seagate Barracuda 1TB) was used to replace the drive this time. This was successful rebuild into the RAID array and looks to be functioning with issue after being powered on for 428 hours so far.

drn05r commented 2 years ago

I have give Lance the failed replacement disk back to Lance to see if this can be sent back for a refund/replacement. Not sure what we will do with the replacement if we get one.