Failed to collect raid controller device S.M.A.R.T data

sdragon83 commented 1 year ago

I tried to collect data from a server with a raid controller through smartctl exporter.

However, an error occurred as below.

How can i collect S.M.A.R.T data on raid controller devices?

tomazb commented 1 year ago

Yes, this is the real reason why you need such a service in the first place - to monitor devices that are not easily visible inside the operating system.

marpears commented 1 year ago

If the device type was able to be retrieved and passed into function readSMARTctl then this could be used with the --device flag and would be a safer way of being able to scan all device types. EG as below :

smartctl --json --info --health --attributes --tolerance=verypermissive --nocheck=standby --format=brief --device megaraid,0 /dev/bus/1

josefzahner commented 1 year ago

@marpears I can read the device info with smartctl including the device option, but NOT with smartctl_exporter...

$ smartctl --json --info --health --attributes --tolerance=verypermissive --nocheck=standby --format=brief --device cciss,1 /dev/sdb
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      3
    ],
    "svn_revision": "5338",
    "platform_info": "x86_64-linux-3.10.0-957.27.2.el7.x86_64",
    "build_info": "(local build)",
    "argv": [
      "smartctl",
      "--json",
...

but this doesn't work:

$ smartctl_exporter --smartctl.device='cciss,1 /dev/sdb'
ts=2022-12-02T09:40:45.718Z caller=main.go:90 level=info msg="Starting smartctl_exporter" version="(version=0.9.1, branch=HEAD, revision=a58c632ea8fa0f4f10a9ac9e941e610a7bb2efc1)"
ts=2022-12-02T09:40:45.718Z caller=main.go:91 level=info msg="Build context" build_context="(go=go1.19.3, user=root@fa2a9a938fb5, date=20221106-21:46:18)"
ts=2022-12-02T09:40:45.735Z caller=main.go:112 level=warn msg="Device unavailable" name="cciss,1 /dev/sdb"
ts=2022-12-02T09:40:45.735Z caller=main.go:119 level=info msg="No devices specified, trying to load them automatically"
ts=2022-12-02T09:40:45.735Z caller=main.go:124 level=error msg="No devices found"

lahwaacz commented 1 year ago

@josefzahner The --smartctl.device flag in smartctl_exporter does not translate to the --device flag of smartctl. The exporter expects just the /dev/ node path. Also note that --device cciss,1 /dev/sdb are 3 distinct flags passed on the command line, you can't pass all of that to --smartctl.device.

kfox1111 commented 1 year ago

how does one configure cciss,1? I need to do it on some of my nodes and have not found a way yet.

anthonyeleven commented 1 year ago

This is a gating factor for me too. I've added comments to the above issue and linked PR.

jakubgs commented 1 year ago

This is also an issue for me. I guess a proper solution would involve adding a separate flag to provide extra flags for smartctl.

anthonyeleven commented 1 year ago

The tool should discover such HBAs and do so automagically at per-device granularity, since there can and will be a mixed population of direct-attach, passthrough, and hidden-by-VD drives on various sytems and especially within a given system.

smartmon.sh for example does this:



for device in ${device_list}; do
  disk="$(echo ${device} | cut -f1 -d'|')"
  type="$(echo ${device} | cut -f2 -d'|')"
  active=1
  echo "smartctl_run{disk=\"${disk}\",type=\"${type}\"}" "$(TZ=UTC date '+%s')"
  # Check if the device is in a low-power mode
  $SMARTCTL -n standby -d "${type}" "${disk}" > /dev/null || active=0
  echo "device_active{disk=\"${disk}\",type=\"${type}\"}" "${active}"
  # Skip further metrics to prevent the disk from spinning up
  test ${active} -eq 0 && continue
  # Get the SMART information and health
  $SMARTCTL  -i -H -d "${type}" "${disk}" | parse_smartctl_info "${disk}" "${type}"
  # Get the SMART attributes
  case ${type} in
  sat) $SMARTCTL -A -d "${type}" "${disk}" | parse_smartctl_attributes "${disk}" "${type}" ;;
  sat+megaraid*) $SMARTCTL -A -d "${type}" "${disk}" | parse_smartctl_attributes "${disk}" "${type}" ;;
  scsi) $SMARTCTL -A -d "${type}" "${disk}" | parse_smartctl_scsi_attributes "${disk}" "${type}" ;;
  nvme) $SMARTCTL -A -d "${type}" "${disk}" | parse_smartctl_nvme_attributes "${disk}" "${type}" ;;
  megaraid*) $SMARTCTL -A -d "${type}" "${disk}" | parse_smartctl_scsi_attributes "${disk}" "${type}" ;;
  *)
    echo "disk type is not sat, scsi or megaraid but ${type}"
    exit
    ;;
  esac
done | format_output```

Mind you, I *despise* RoC HBAs and would just as soon never have one, or to set passthrough/JBOD on legacy systems, but walking into an existing deployment of thousands I don't have the luxury of greenfield.

anthonyeleven commented 1 year ago

@jakubgs It's more than just extra flags, it's discovery too.


/dev/sda -d scsi # /dev/sda, SCSI device
/dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], SCSI device
/dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], SCSI device```

I no longer have HP HBAs, but it would be polite for however this is done to be architected in such a way that they could be supported later.

I hope to sunset RoC VDs through attrition, but that will take years :-/

kfox1111 commented 1 year ago

Any way to do this yet?

anthonyeleven commented 1 year ago

I’d do it myself if I had the coding skills. It really is a fatal flaw. Mind you HBA RAID is itself a fatal flaw but Dell’s BOSS-N1 is too useful, though one has to invoke ‘mvcli’ to get status. On Jul 14, 2023, at 5:10 PM, kfox1111 @.***> wrote: Any way to do this yet?

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you commented.Message ID: @.***>

jakubgs commented 11 months ago

I did a bit of research into this and found out that these devices can be found with smartctl by using -d scsi:

 > smartctl --json --scan | jq -c '.devices[] | { name, protocol }'         
jq: error (at <stdin>:21): Cannot iterate over null (null)

 > smartctl --json --scan --device scsi | jq -c '.devices[] | { name, protocol }'
{"name":"/dev/sda","protocol":"SCSI"}
{"name":"/dev/sdb","protocol":"SCSI"}
{"name":"/dev/sdc","protocol":"SCSI"}

But there might be an even better way to identify those devices, and that is lsblk:

 > lsblk --json -O | jq -c '.blockdevices[] | { path, hctl, subsystems }'
{"path":"/dev/sda","hctl":"0:1:0:0","subsystems":"block:scsi:pci"}
{"path":"/dev/sdb","hctl":"0:1:0:1","subsystems":"block:scsi:pci"}
{"path":"/dev/sdc","hctl":"0:1:0:2","subsystems":"block:scsi:pci"}

As we can see the hctl field informs us what number to use for --device cciss,N and sybsystems informs us that scsi is being used, which together can be a pretty reliable heuristic for detecting HBA.

And different host without HBA:

 > lsblk --json -O | jq -c '.blockdevices[] | { path, hctl, subsystems }'
{"path":"/dev/nvme0n1","hctl":null,"subsystems":"block:nvme:pci"}
{"path":"/dev/nvme1n1","hctl":null,"subsystems":"block:nvme:pci"}

I don't know what maintainers would think about using a tool other than systemctl for discovery, but this is a pretty standard tool available in most system, and we could still have a fallback to smartctl if unavailable.

I'm going to read a bit the code to see how difficult this would be.

jakubgs commented 11 months ago

Main issue as far as I can tell is that even if you discover the devices, often you won't get much info from them:

{
  "json_format_version": [1, 0],
  "smartctl": {
    "version": [7, 2],
    "svn_revision": "5155",
    "platform_info": "x86_64-linux-5.15.0-79-generic",
    "build_info": "(local build)",
    "argv": ["smartctl", "-A", "--device", "cciss,1", "/dev/sdb", "--json"],
    "exit_status": 0
  },
  "device": {
    "name": "/dev/sdb",
    "info_name": "/dev/sdb [cciss_disk_01] [SCSI]",
    "type": "cciss",
    "protocol": "SCSI"
  },
  "temperature": {
    "current": 21,
    "drive_trip": 70
  },
  "power_on_time": {
    "hours": 47138,
    "minutes": 5
  },
  "scsi_grown_defect_list": 0
}

Temperature and power-on time... not great.

anthonyeleven commented 11 months ago

Better than nothing, but yeah. I haven't had an HP HBA to work with for years, but re the scsi factor above, is the subject drive SAS? I would not be surprised if this would not surface SATA (but it might).

anthonyeleven commented 11 months ago

I'm increasingly leaning toward having a protege write a SMART harvester from scratch in Python, which would make it easier to normalize the vagaries of data that smartctl gives us. Then redirect the output into a file and let node_exporter's textfile collector snarf it up.

jakubgs commented 11 months ago

Personally I'd rather fix what we have working than try from scratch. I'm busy enough dealing with what I have working already have the time to reinvent wheels. Even if I get just temp and power-on hours that's better than deployed SMART exporter just failing at startup and Prometheus returning alerts for the downed service.

jakubgs commented 11 months ago

But your point about SATA/SAS is well made. I will have to check how that is done on my servers.

prometheus-community / smartctl_exporter

Failed to collect raid controller device S.M.A.R.T data #89