librenms / librenms-agent

LibreNMS Agent & Scripts
GNU General Public License v2.0
116 stars 186 forks source link

mdadm script fails with Intel Matrix Storage RAID #521

Open shpokas opened 1 month ago

shpokas commented 1 month ago

Ups, looks like I opened another issue in the wrong place

Now creating another one here, please feel free to close a duplicate.

The problem

I am trying to setup mdadm application monitoring. mdadm script fails with /etc/snmp/mdadm /etc/snmp/mdadm: line 53: (2 - ): syntax error: operand expected (error token is ")")

debug run of the same script is attached. mdadm-debug-run.log

[librenms@monitoring ~]$ ./validate.php 
===========================================
Component | Version
--------- | -------
LibreNMS  | 24.5.0-14-gc777d5429 (2024-05-29T22:42:05+03:00)
DB Schema | 2024_04_29_183605_custom_maps_drop_background_suffix_and_background_version (294)
PHP       | 8.3.7
Python    | 3.9.18
Database  | MariaDB 11.3.2-MariaDB
RRDTool   | 1.7.2
SNMP      | 5.9.1
===========================================

[OK]    Composer Version: 2.7.6
[OK]    Dependencies up-to-date.
[OK]    Database connection successful
[OK]    Database Schema is current
[OK]    SQL Server meets minimum requirements
[OK]    lower_case_table_names is enabled
[OK]    MySQL engine is optimal
[OK]    Database and column collations are correct
[OK]    Database schema correct
[OK]    MySQL and PHP time match
[OK]    Active pollers found
[OK]    Dispatcher Service not detected
[OK]    Locks are functional
[OK]    Python poller wrapper is polling
[OK]    Redis is unavailable
[OK]    rrdtool version ok
[OK]    Connected to rrdcached
[librenms@monitoring ~]$

Monitored host has four disks in two software raid arrays. Two SATA 512GB SSDs are configured in RAID1 by Intel Matrix Storage RAID in EFI. Two NVMe 1TB disks are configured in RAID1 by operating system.

mdadm configuration, lshw output, lsblk output is added below. Thanks for looking into this.


cat /etc/mdadm.conf

# mdadm.conf written out by anaconda
MAILADDR root
AUTO +imsm +1.x -all
ARRAY /dev/md/imsm UUID=f14c58a0:f20c0ab1:ddadb5be:dc8ddb75
ARRAY /dev/md/0 metadata=1.2 UUID=60a657e6:bb76e556:b27a83db:734a3edc name=<REMOVED>
ARRAY /dev/md/Volume0 container=f14c58a0:f20c0ab1:ddadb5be:dc8ddb75 member=0 UUID=bf4cc729:5d9bf0ae:ce01b1bf:46fcb1f4
-------------------------------------------------------

cat /proc/mdstat

Personalities : [raid1] 
md0 : active raid1 nvme0n1[2] nvme1n1[1]
      976630464 blocks super 1.2 [2/2] [UU]
      bitmap: 3/8 pages [12KB], 65536KB chunk

md126 : active raid1 sda[1] sdb[0]
      475099136 blocks super external:/md127/0 [2/2] [UU]

md127 : inactive sda[1](S) sdb[0](S)
      10402 blocks super external:imsm

unused devices: <none>

mdadm --examine /dev/md/imsm

/dev/md/imsm:
          Magic : Intel Raid ISM Cfg Sig.
        Version : 1.1.00
    Orig Family : 3690cb86
         Family : 3690cb86
     Generation : 000ed44c
  Creation Time : Thu Jan 18 14:13:41 2024
     Attributes : All supported
           UUID : f14c58a0:f20c0ab1:ddadb5be:dc8ddb75
       Checksum : 99b796dd correct
    MPB Sectors : 1
          Disks : 2
   RAID Devices : 1

  Disk01 Serial : S42YNE0M500509N
          State : active
             Id : 00000001
    Usable Size : 1000204814 (476.93 GiB 512.10 GB)

[Volume0]:
       Subarray : 0
           UUID : bf4cc729:5d9bf0ae:ce01b1bf:46fcb1f4
     RAID Level : 1
        Members : 2
          Slots : [UU]
    Failed disk : none
      This Slot : 1
    Sector Size : 512
     Array Size : 950198272 (453.09 GiB 486.50 GB)
   Per Dev Size : 950200320 (453.09 GiB 486.50 GB)
  Sector Offset : 0
    Num Stripes : 3711712
     Chunk Size : 64 KiB
       Reserved : 0
  Migrate State : idle
      Map State : normal
    Dirty State : clean
     RWH Policy : off
      Volume ID : 1

  Disk00 Serial : S42YNE0M500504E
          State : active
             Id : 00000000
    Usable Size : 1000204814 (476.93 GiB 512.10 GB)

mdadm --examine /dev/md/0

/dev/md/0:
   MBR Magic : aa55
Partition[0] :   1953260927 sectors at            1 (type ee)

mdadm --examine /dev/md/Volume0

/dev/md/Volume0:
   MBR Magic : aa55
Partition[0] :    950198271 sectors at            1 (type ee)

lshw -class disk -class storage

  *-raid                    
       description: RAID bus controller
       product: SATA Controller [RAID Mode]
       vendor: Intel Corporation
       physical id: 17
       bus info: pci@0000:00:17.0
       logical name: scsi0
       logical name: scsi1
       logical name: scsi5
       version: 00
       width: 32 bits
       clock: 66MHz
       capabilities: raid msi pm bus_master cap_list emulated
       configuration: driver=ahci latency=0
       resources: irq:37 memory:aaf24000-aaf25fff memory:aaf27000-aaf270ff ioport:3050(size=8) ioport:3040(size=4) ioport:3020(size=32) memory:aaf26000-aaf267ff
     *-disk:0
          description: ATA Disk
          product: Samsung SSD 860
          physical id: 0
          bus info: scsi@0:0.0.0
          logical name: /dev/sda
          version: 1B6Q
          serial: S42YNE0M500504E
          size: 476GiB (512GB)
          capabilities: gpt-1.00 partitioned partitioned:gpt
          configuration: ansiversion=5 guid=43866b72-683b-4ac0-887f-94fdb01937e4 logicalsectorsize=512 sectorsize=512
     *-disk:1
          description: ATA Disk
          product: Samsung SSD 860
          physical id: 1
          bus info: scsi@1:0.0.0
          logical name: /dev/sdb
          version: 1B6Q
          serial: S42YNE0M500509N
          size: 476GiB (512GB)
          capabilities: gpt-1.00 partitioned partitioned:gpt
          configuration: ansiversion=5 guid=43866b72-683b-4ac0-887f-94fdb01937e4 logicalsectorsize=512 sectorsize=512
     *-cdrom
          description: DVD-RAM writer
          product: DVD-RAM GHC0N
          vendor: HL-DT-ST
          physical id: 0.0.0
          bus info: scsi@5:0.0.0
          logical name: /dev/cdrom
          logical name: /dev/sr0
          version: MA02
          capabilities: removable audio cd-r cd-rw dvd dvd-r dvd-ram
          configuration: ansiversion=5 status=nodisc
  *-nvme
       description: NVMe device
       product: Samsung SSD 980 PRO with Heatsink 1TB
       vendor: Samsung Electronics Co Ltd
       physical id: 0
       bus info: pci@0000:17:00.0
       logical name: /dev/nvme0
       version: 5B2QGXA7
       serial: S6WSNJ0W122428R
       width: 64 bits
       clock: 33MHz
       capabilities: nvme pm msi pciexpress msix nvm_express bus_master cap_list
       configuration: driver=nvme latency=0 nqn=nqn.1994-11.com.samsung:nvme:980PRO:M.2:S6WSNJ0W122428R state=live
       resources: irq:38 memory:c5e00000-c5e03fff
     *-namespace:0
          description: NVMe disk
          physical id: 0
          logical name: /dev/ng0n1
     *-namespace:1
          description: NVMe disk
          physical id: 1
          bus info: nvme@0:1
          logical name: /dev/nvme0n1
          size: 931GiB (1TB)
          configuration: logicalsectorsize=512 sectorsize=512 wwid=eui.002538b131408bde
  *-nvme
       description: NVMe device
       product: Samsung SSD 980 PRO with Heatsink 1TB
       vendor: Samsung Electronics Co Ltd
       physical id: 0
       bus info: pci@0000:18:00.0
       logical name: /dev/nvme1
       version: 5B2QGXA7
       serial: S6WSNS0W404164V
       width: 64 bits
       clock: 33MHz
       capabilities: nvme pm msi pciexpress msix nvm_express bus_master cap_list
       configuration: driver=nvme latency=0 nqn=nqn.1994-11.com.samsung:nvme:980PRO:M.2:S6WSNS0W404164V state=live
       resources: irq:40 memory:c5d00000-c5d03fff
     *-namespace:0
          description: NVMe disk
          physical id: 0
          logical name: /dev/ng1n1
     *-namespace:1
          description: NVMe disk
          physical id: 1
          bus info: nvme@1:1
          logical name: /dev/nvme1n1
          size: 931GiB (1TB)
          capabilities: partitioned partitioned:dos
          configuration: logicalsectorsize=512 sectorsize=512 signature=4751c961 wwid=eui.002538b431402d4c

lsblk

NAME                    MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda                       8:0    0 476.9G  0 disk  
├─md126                   9:126  0 453.1G  0 raid1 
│ ├─md126p1             259:3    0   600M  0 part  /boot/efi
│ ├─md126p2             259:4    0     1G  0 part  /boot
│ └─md126p3             259:5    0 375.5G  0 part  
│   ├─almalinux-root    253:0    0    70G  0 lvm   /var/named/chroot/usr/share/GeoIP
│   │                                              /var/named/chroot/usr/lib64/named
│   │                                              /var/named/chroot/usr/lib64/bind
│   │                                              /var/named/chroot/etc/named
│   │                                              /var/named/chroot/etc/services
│   │                                              /var/named/chroot/etc/protocols
│   │                                              /var/named/chroot/etc/crypto-policies/back-ends/bind.config
│   │                                              /var/named/chroot/etc/rndc.key
│   │                                              /var/named/chroot/etc/named.rfc1912.zones
│   │                                              /var/named/chroot/etc/named.conf
│   │                                              /var/named/chroot/etc/named.root.key
│   │                                              /var/named/chroot/etc/localtime
│   │                                              /
│   ├─almalinux-swap    253:1    0  15.5G  0 lvm   [SWAP]
│   ├─almalinux-var_tmp 253:2    0    10G  0 lvm   /var/tmp
│   ├─almalinux-var     253:3    0    40G  0 lvm   /var/named/chroot/var/named
│   │                                              /var
│   ├─almalinux-home    253:4    0    30G  0 lvm   /home
│   └─almalinux-tmp     253:5    0    10G  0 lvm   /tmp
└─md127                   9:127  0     0B  0 md    
sdb                       8:16   0 476.9G  0 disk  
├─md126                   9:126  0 453.1G  0 raid1 
│ ├─md126p1             259:3    0   600M  0 part  /boot/efi
│ ├─md126p2             259:4    0     1G  0 part  /boot
│ └─md126p3             259:5    0 375.5G  0 part  
│   ├─almalinux-root    253:0    0    70G  0 lvm   /var/named/chroot/usr/share/GeoIP
│   │                                              /var/named/chroot/usr/lib64/named
│   │                                              /var/named/chroot/usr/lib64/bind
│   │                                              /var/named/chroot/etc/named
│   │                                              /var/named/chroot/etc/services
│   │                                              /var/named/chroot/etc/protocols
│   │                                              /var/named/chroot/etc/crypto-policies/back-ends/bind.config
│   │                                              /var/named/chroot/etc/rndc.key
│   │                                              /var/named/chroot/etc/named.rfc1912.zones
│   │                                              /var/named/chroot/etc/named.conf
│   │                                              /var/named/chroot/etc/named.root.key
│   │                                              /var/named/chroot/etc/localtime
│   │                                              /
│   ├─almalinux-swap    253:1    0  15.5G  0 lvm   [SWAP]
│   ├─almalinux-var_tmp 253:2    0    10G  0 lvm   /var/tmp
│   ├─almalinux-var     253:3    0    40G  0 lvm   /var/named/chroot/var/named
│   │                                              /var
│   ├─almalinux-home    253:4    0    30G  0 lvm   /home
│   └─almalinux-tmp     253:5    0    10G  0 lvm   /tmp
└─md127                   9:127  0     0B  0 md    
sr0                      11:0    1  1024M  0 rom   
nvme0n1                 259:0    0 931.5G  0 disk  
└─md0                     9:0    0 931.4G  0 raid1 
  └─md0p1               259:2    0 931.4G  0 part  /vms
nvme1n1                 259:1    0 931.5G  0 disk  
└─md0                     9:0    0 931.4G  0 raid1 
  └─md0p1               259:2    0 931.4G  0 part  /vms
shpokas commented 1 month ago

This patch suits my needs:

--- mdadm.orig  2024-05-31 08:52:52.153777132 +0300
+++ mdadm   2024-05-31 11:25:44.772243743 +0300
@@ -40,6 +40,10 @@
             [[ "${mdadmArray}" =~ '/dev/md'[[:digit:]]+'p' ]] && continue

             mdadmName="$(basename "$(realpath "${mdadmArray}")")"
+
+            # Ignore inactive arrays
+            [[ $(grep "^${mdadmName}" /proc/mdstat) =~ 'inactive' ]] && continue
+
             mdadmSysDev="/sys/block/${mdadmName}"

             degraded=$(maybe_get "${mdadmSysDev}/md/degraded")
VVelox commented 1 month ago

@shpokas Thanks. Tested that and does not appear to break anything. What does /proc/mdstat look like there?

There are likely some other things that need cleaned up as well.

VVelox commented 1 month ago

Derp! Sorry, missed that. You did. Thanks!

VVelox commented 1 month ago

@shpokas

Currently pondering best way to handle inactive arrays.

ls /sys/block/md0/slaves/
ls -l /sys/block/md0/
cat /sys/block/md0/md/level
cat /sys/block/md0/md/raid_disks
VVelox commented 1 month ago

Sorry, those commands should be for md127.

Currently thinking of just adding a counter for inactive.