thomas-krenn / check_lsi_raid

Monitoring plugin to check MegaRAID controllers
GNU General Public License v3.0
59 stars 26 forks source link

check_lsi_raid affects host I/O performance #42

Closed BlackZork closed 3 weeks ago

BlackZork commented 2 months ago

01:00.0 RAID bus controller: Broadcom / LSI MegaRAID SAS-3 3108 [Invader] (rev 02) storcli 007.3006.0000.0000-1 on ArchLinux (installed from AUR).

# /usr/lib/monitoring-plugins/check_lsi_raid -V
check_lsi_raid: Nagios/Icinga plugin to check LSI Raid Controller status
Version: 2.5
StorCli SAS Customization Utility Ver 007.3006.0000.0000 Apr 17, 2024

Icinga service definition:

apply Service for (name => config in host.vars.lsiraid) {
  import "generic-service"

  check_command = "lsi-raid"
  vars += config
}

object Host "myhost" {
  /* Import the default host template defined in `templates.conf`. */
  import "linux-server"

  address = "1.1.1.1"

  vars.lsiraid["LSI 3108"] = {
    lsi_ignored_other_errors = 9999999
    lsi_ignored_media_errors = 9999999
  }

  vars.lsiraid["RAID slot 1"] = {
    lsi_enclosure_id=1
    lsi_pd_id=0
    lsi_ignored_other_errors=8
  }

  [... and next 15 slots as above]
}

When the default 1-minute check interval was used, host I/O performance suffered dramatically. It looks like the controller stops some I/O operations when a storcli command is executed. I discovered this by looking for processes in the IO_WAIT state. The number of waiting processes increased when storcli was executed and I experienced slowdowns of various VMs and services hosted on my server.

As a workaround I've added check_period=15m and 15m day TimePeriod window to force Icinga2 to check LSI only once a day.

I am aware that there is probably nothing you can do to fix this problem. I spent a lot of time trying to figure out what was causing I/O problems on my host, so it may be worth adding a warning to this plugin documentation for others.

gschoenberger commented 3 weeks ago

The 1 minute check interval might in fact be not a suitable option for this plugin! I think the heaviest operation is:

time /usr/local/bin/storcli /c0 show all
real    0m1,798s
user    0m0,034s
sys 0m0,038s

If I can remember it correctly we had the issue when we were running "adpallinfo" in the plugin.

Added a "Warning" to the README with commit e27cc73