openstreetmap / operations

OSMF Operations Working Group issue tracking
https://operations.osmfoundation.org/
98 stars 13 forks source link

Degraded array on necrosan #845

Open tomhughes opened 1 year ago

tomhughes commented 1 year ago

A disk on necrosan appears to have failed (it no longer seems to be visible to the controller) and as a result it has a degraded array:

-- Controller information --
-- ID | H/W Model       | RAM    | Temp | BBU    | Firmware     
c0    | PERC H730P Mini | 2048MB | 50C  | Good   | FW: 25.5.9.0001 

-- Array information --
-- ID | Type   |    Size |  Strpsz | Flags | DskCache |   Status |  OS Path | CacheCade |InProgress   
c0u0  | RAID-5 |   7854G |   64 KB | RA,WB |  Enabled | Degraded | /dev/sda | None      |None         

-- Disk information --
-- ID   | Type | Drive Model                               | Size     | Status          | Speed    | Temp | Slot ID  | LSI ID  
c0u0p0  | SSD  | PHDV7293005A960FGNSSDSC2BB960G7R N201DL43 | 893. Gb  | Online, Spun Up | 6.0Gb/s  | 24C  | [32:0]   | 0       
c0u0p1  | SSD  | PHDV729401VH960FGNSSDSC2BB960G7R N201DL43 | 893. Gb  | Online, Spun Up | 6.0Gb/s  | 24C  | [32:1]   | 1       
c0u0p2  | SSD  | PHDV729401VL960FGNSSDSC2BB960G7R N201DL43 | 893. Gb  | Online, Spun Up | 6.0Gb/s  | 23C  | [32:2]   | 2       
c0u0p3  | SSD  | PHDV7294025G960FGNSSDSC2BB960G7R N201DL43 | 893. Gb  | Online, Spun Up | 6.0Gb/s  | 23C  | [32:3]   | 3       
c0u0p4  | SSD  | PHDV72940259960FGNSSDSC2BB960G7R N201DL43 | 893. Gb  | Online, Spun Up | 6.0Gb/s  | 23C  | [32:4]   | 4       
c0u0p5  | SSD  | PHDV72940209960FGNSSDSC2BB960G7R N201DL43 | 893. Gb  | Online, Spun Up | 6.0Gb/s  | 23C  | [32:5]   | 5       
c0u0p6  | SSD  | PHDV7294025F960FGNSSDSC2BB960G7R N201DL43 | 893. Gb  | Online, Spun Up | 6.0Gb/s  | 22C  | [32:6]   | 6       
c0u0p7  | SSD  | Y7ES1072TBLTTHNSF8960CCSE DACB            | 893. Gb  | Online, Spun Up | 6.0Gb/s  | 23C  | [32:7]   | 7       
c0u0p9  | SSD  | PHDV7294023T960FGNSSDSC2BB960G7R N201DL43 | 893. Gb  | Online, Spun Up | 6.0Gb/s  | 23C  | [32:9]   | 9       

There is at least one disk/array in a NOT OPTIMAL state.
RAID ERROR - Arrays: OK:0 Bad:1 - Disks: OK:9 Bad:0

The machine is an old donated render server that is not currently in use.

Firefishy commented 1 year ago

The disk came back soon after a reboot. For the moment it looks ok. If it fails again we can re-open this ticket.

Firefishy commented 3 months ago

A disk has failed in the server and appears to not be coming back after a restart.

Firefishy commented 3 months ago
-- Controller information --
-- ID | H/W Model       | RAM    | Temp | BBU    | Firmware
c0    | PERC H730P Mini | 2048MB | 55C  | Good   | FW: 25.5.9.0001

-- Array information --
-- ID | Type   |    Size |  Strpsz | Flags | DskCache |   Status |  OS Path | CacheCade |InProgress
c0u0  | RAID-5 |   7854G |   64 KB | RA,WB |  Enabled | Degraded | /dev/sda | None      |None

-- Disk information --
-- ID   | Type | Drive Model                               | Size     | Status          | Speed    | Temp | Slot ID  | LSI ID
c0u0p0  | SSD  | PHDV7293005A960FGNSSDSC2BB960G7R N201DL43 | 893. Gb  | Online, Spun Up | 6.0Gb/s  | 29C  | [32:0]   | 0
c0u0p1  | SSD  | PHDV729401VH960FGNSSDSC2BB960G7R N201DL43 | 893. Gb  | Online, Spun Up | 6.0Gb/s  | 29C  | [32:1]   | 1
c0u0p2  | SSD  | PHDV729401VL960FGNSSDSC2BB960G7R N201DL43 | 893. Gb  | Online, Spun Up | 6.0Gb/s  | 29C  | [32:2]   | 2
c0u0p3  | SSD  | PHDV7294025G960FGNSSDSC2BB960G7R N201DL43 | 893. Gb  | Online, Spun Up | 6.0Gb/s  | 30C  | [32:3]   | 3
c0u0p4  | SSD  | PHDV72940259960FGNSSDSC2BB960G7R N201DL43 | 893. Gb  | Online, Spun Up | 6.0Gb/s  | 30C  | [32:4]   | 4
c0u0p5  | SSD  | PHDV72940209960FGNSSDSC2BB960G7R N201DL43 | 893. Gb  | Online, Spun Up | 6.0Gb/s  | 30C  | [32:5]   | 5
c0u0p6  | SSD  | PHDV7294025F960FGNSSDSC2BB960G7R N201DL43 | 893. Gb  | Online, Spun Up | 6.0Gb/s  | 29C  | [32:6]   | 6
c0u0p7  | SSD  | Y7ES1072TBLTTHNSF8960CCSE DACB            | 893. Gb  | Online, Spun Up | 6.0Gb/s  | 30C  | [32:7]   | 7
c0u0p8  | SSD  | PHDV729401WP960FGNSSDSC2BB960G7R N201DL43 | 893. Gb  | Online, Spun Up | 6.0Gb/s  | 29C  | [32:8]   | 8
Firefishy commented 3 months ago

I have reported the issue to our hosting provider.

Firefishy commented 1 month ago

necrosan briefly reported a read-only filesystem on Fri, 26 Jul 2024, 02:22.

The system was then rebooted but did not come back.

 It is likely the RAID array has failed due to another disk dropping out.

Contacted hosts, if no response in a few days likely best to decommission the machine.