smartmon scripts: quote in value

stefangweichinger commented 1 year ago

I fiddle with extracting data out of smartctl.

A disk gives me this:

smartmon_device_info{device="/dev/bus/0",disk="megaraid,5",model_family="Toshiba 3.5\" DT01ACA... Desktop HDD",device_model="TOSHIBA DT01ACA100",serial_number="75NHT0MNS",firmware_version="MS2OA7C0"} 1

The problem is in

model_family="Toshiba 3.5\" DT01ACA... Desktop HDD"

There's an escaped quote after "3.5", this leads to issues with the text exporter. I browsed for various forks and versions of smartmon.sh and smartmon.py.

I currently seem to have the extraction fixed, but I seem to have some problematic data lines /metrics in prometheus. Sorry for being a bit offtopic: how can I delete these smartmon-related time series without deleting too much? I need to clean this up before I can dig further and check if the scripts work now.

I don't know if having quotes in the SMART values is OK.

stefangweichinger commented 1 year ago

Is smartmon.py or smartmon.sh "better"?

stefangweichinger commented 1 year ago

deleting

Tried this:

# curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={__name__="smartmon_device_info", device="/dev/bus/0",  instance="that.server.tld:9100", job="node", model_family="Toshiba 3.5" DT01ACA... Desktop HDD"}'

Not yet fully successful ...

dswarbrick commented 1 year ago

There is discussion in #119 to replace both existing smartmon scripts with a new Python script and parse the JSON output of smartctl, which would likely avoid issues such as this.

stefangweichinger commented 1 year ago

ah, I see. But no direct solution available. I also find https://github.com/prometheus-community/smartctl_exporter ... doesn't work very well either.

dswarbrick commented 1 year ago

@stefangweichinger I am working on an interim refactor of the existing smartmon.py to use the official Prometheus client_python library for rendering the metrics. This will solve any quoting issues.

However, since the homegrown metric printing functions in smartmon.py take some liberties with with the validity of the metrics format, it will take a little while.

stefangweichinger commented 1 year ago

@stefangweichinger I am working on an interim refactor of the existing smartmon.py to use the official Prometheus client_python library for rendering the metrics. This will solve any quoting issues.

However, since the homegrown metric printing functions in smartmon.py take some liberties with with the validity of the metrics format, it will take a little while.

Wow, sounds good. Good luck with this. I will try to come up with some ugly workaround for those Toshiba disks (hopefully).

dswarbrick commented 1 year ago

@stefangweichinger Please test smartmon.py from the smartmon.py-registry branch of this repo.

The original code made heavy use of generators, which was only possible due to the fact that it took some shortcuts with label consistency. When using the official Prometheus client libraries, a metric's label names need to be declared in advance. If a particular data point has no value for a certain label, it should be set to empty string. All labels are rendered by the official client libraries, even when they are empty string.

However, most homegrown metric rendering code that I've seen ignores this completely, and data points often have variable label sets. Node_exporter tolerates this when reading the .prom files from disk, and ensures consistent label sets when exposing them to a Prometheus scrape.

stefangweichinger commented 1 year ago

@dswarbrick thanks, will try asap

stefangweichinger commented 1 year ago

@dswarbrick The script works fine so far, I get data/metrics. So far no graphs in the dashboard, I think I use some dashboard that uses wrong queries. For example it seems to mix up "device" and "disk". Do you have a dashboard that you know this new script should work with? Thank you!

Ah, btw: that one disk is parsed like this:

smartmon_device_info{device="/dev/bus/0",device_model="TOSHIBA DT01ACA100",disk="megaraid,5",firmware_version="MS2OA7C0",lun_id="",model_family="Toshiba 3.5\" DT01ACA... Desktop HDD",product="",revision="",serial_number="75NHT0MNS",vendor=""} 1.0

So that quote is escaped inside the field. That might be OK for prometheus and Grafana, I assume.

dswarbrick commented 1 year ago

@stefangweichinger Sorry, I can't advise you on which dashboard is best to use. I usually make my own dashboards, and just use the community ones for inspiration occasionally. I suspect that you might not have the best success with community dashboards since your disks are behind a MegaRAID controller. The dashboard that you're using might be geared more towards consumer-grade setups, i.e. expects disks to be typical /dev/sda, /dev/sdb etc.

The escaped double-quote should be fine (and in fact the original script should have been doing that anyway) - see the string replace in:

    labels = ','.join(
        '{k}="{v}"'.format(k=k, v=v.replace('"', '\\"')) for k, v in metric.labels.items())

stefangweichinger commented 1 year ago

@dswarbrick ok .. I have to reduce the moving parts here. I decide to use your latest script to collect data, the prom-files look ok to me, scraping/exporting seems to work, only the displaying and querying seems to be a problem right now. I will try to completely get rid of all smartmon-related time series in prometheus now, to remove these problematic entries from that Toshiba-disk. And then I will try to decide which dashboard to use. I had to edit the variables in one of them, that one used a outdated(?) field in the metrics.

stefangweichinger commented 1 year ago

@dswarbrick

For example this query:

avg(smartmon_temperature_celsius_raw_value{instance=~"$instance",disk=~"$disk",type=~"$type",serial_number=~"$serial_number",model_family=~"$model_family",device_model=~"$device_model"}) by (device_model,serial_number)

I only get graphs for one out of three servers currently. I will make sure to use the same script everywhere and cleanup things.

stefangweichinger commented 1 year ago

and yes: disks behind that megaraid-hardware gives additional headache

dswarbrick commented 1 year ago

Perhaps I can offer a little more clarification, as I have used this collector in the past at a previous job, with a mixture of simple HBA (or onboard SATA) setups, as well as MegaRAID controllers.

With onboard SATA controllers or HBAs that expose each disk to the OS as individual drives, the device label will be the block device name that you're familiar with, e.g. /dev/sda, /dev/sdb etc, and the disk label should be 0.

However, consider a MegaRAID controller with multiple disks attached, and e.g. a four-disk RAID5 array. The RAID5 virtual drive may be exposed to the OS as e.g. /dev/sdb, but smartctl needs a way to address the individual physical disks that make up that virtual drive. In such a case, the device label refers to the MegaRAID controller, and the disk label uses a vendor-specific format to identify physical disks on that controller.

If you had multiple MegaRAID controllers, the device labels would have values /dev/bus/0, /dev/bus/1, etc. The disk labels would contain potentially overlapping IDs - however the combination of device and disk label should still uniquely identify a single physical disk.

If you're using a dashboard that primarily focuses on the device label, that will likely only work for onboard SATA & HBA setups. For MegaRAID controllers, the dashboard will need to take into account both the device and the disk labels.

stefangweichinger commented 1 year ago

@dswarbrick thanks a lot, yes, understood.

I managed to write a shell script that removes all metrics starting with "smartmon" from prometheus. Wrote an ansible playbook to deploy your smartmon.py with working systemd-units (.service, .timer).

So I should start from scratch regarding these metrics.

Data is scraped, but in Grafana I still get stuff like:

bad_data: invalid parameter "query": 1:870: parse error: error parsing regexp: missing closing ): `^(?:(Micron 5100 Pro / 52x0 / 5300 SSDs|N/A|Samsung based SSDs|Seagate Barracuda 7200\.14 \(AF\)|Toshiba 3\.5)$`

Disabled scraping on that backup server with the Toshiba disk and remove metrics again to maybe get closer. Still the ugly errors in Grafana, maybe that caches something I have to wait for (yes, reopened the dashboard).

And only getting data from one server (instead of 2 enabled, 1 disabled). fixed by python-library ...

Grepping the prom file from your smartmon.py it looks as if I should use this metric for my needs:

smartmon_attr_raw_value{device="/dev/bus/0",disk="megaraid,2",name="temperature_celsius"} 21.0

correct?

In my dashboards I see queries with other metrics, I have to get rid of those (although it looks tempting to be able to build on their queries and not having to start from scratch).

Sorry for getting slightly offtopic somehow.

dswarbrick commented 1 year ago

@stefangweichinger Are you writing that regex by hand yourself? If so it does indeed have a missing closing parenthesis. This should work:

^(?:(Micron 5100 Pro / 52x0 / 5300 SSDs|N/A|Samsung based SSDs|Seagate Barracuda 7200\.14 \(AF\)|Toshiba 3\.5))$

You might need to escape the forward-slashes also, since they are delimiters in JavaScript regexes.

If on the other hand that regex has been generated by Grafana (e.g. as a result of a multiple-select drop-down list), that smells like a bug.

You don't specify what exactly your needs are, but if you are wanting to graph the drive temperature, then yes, that metric is a good starting point.

The nature of time-series databases such as Prometheus means that all mistakes are eventually forgotten. If it doesn't bother you too much, just wait for the buggy metrics to age out. It's much less hassle than trying to delete series.

stefangweichinger commented 1 year ago

@dswarbrick no that regex comes from the query in the dashboard, I assume.

My needs: basically I'd be happy with some information as given by "megaclisas-status":

# megaclisas-status 
-- Controller information --
-- ID | H/W Model                          | RAM    | Temp | BBU    | Firmware     
c0    | RAID Ctrl SAS 6G 5/6 512MB (D2616) | 512MB  | N/A  | Good   | FW: 12.12.0-0174 

-- Array information --
-- ID | Type    |    Size |  Strpsz |   Flags | DskCache |   Status |  OS Path | CacheCade |InProgress   
c0u0  | RAID-1  |    279G |   64 KB | ADRA,WB |  Enabled |  Optimal | /dev/sda | None      |None         
c0u1  | RAID-1  |    558G |   64 KB | ADRA,WB |  Enabled |  Optimal | /dev/sdb | None      |None         
c0u2  | RAID-10 |   1635G |   64 KB | ADRA,WB |  Enabled |  Optimal | /dev/sdc | None      |None         

-- Disk information --
-- ID     | Type | Drive Model                            | Size     | Status          | Speed    | Temp | Slot ID  | LSI ID  
c0u0p0    | HDD  | TOSHIBA MBF2300RC 5212EB07PD80AAKH     | 278.8 Gb | Online, Spun Up | 6.0Gb/s  | 32C  | [252:0]  | 5       
c0u0p1    | HDD  | TOSHIBA MBF2300RC 5212EB07PD90APEP     | 278.8 Gb | Online, Spun Up | 6.0Gb/s  | 31C  | [252:4]  | 4       
c0u1p0    | HDD  | WD WD6001BKHG-50D22SFX9WD-WXD1E63RVKV2 | 558.4 Gb | Online, Spun Up | 6.0Gb/s  | 30C  | [252:1]  | 7       
c0u1p1    | HDD  | WD WD6001BKHG-50D22SFX9WD-WX11E4390674 | 558.4 Gb | Online, Spun Up | 6.0Gb/s  | 30C  | [252:5]  | 8       
c0u2s0p0  | HDD  | WD WD9001BKHG-50D22SFX9WD-WXD1E63SWTY4 | 837.8 Gb | Online, Spun Up | 6.0Gb/s  | 32C  | [252:2]  | 6       
c0u2s0p1  | HDD  | TOSHIBA AL14SEB090N 5203Y7G0A00BFSEC   | 837.8 Gb | Online, Spun Up | Unknown  | 31C  | [252:7]  | 11      
c0u2s1p0  | HDD  | TOSHIBA AL14SEB090N 5203Y7H0A01JFSEC   | 837.8 Gb | Online, Spun Up | Unknown  | 31C  | [252:3]  | 10      
c0u2s1p1  | HDD  | TOSHIBA AL14SEB090N 52037870A04WFSEC   | 837.8 Gb | Online, Spun Up | Unknown  | 31C  | [252:6]  | 12

I want

temperatures of the physical drives
temperature of controller (if available)
status of arrays

The rest would be nice, but not needed. Basically I want some alerts on the temps, done.

I did another look at storcli.py but that doesn't give me pd temps, as noticed a few days ago.

Maybe I run circles here, definitely losing track of which route to follow.

dswarbrick commented 1 year ago

Generally it would not be advisable to filter by something as specific as drive model number. Probably the only time I've had to do that was when I created some alerting rules for warning about known-buggy firmware versions on particular Intel SSD models. I would recommend removing whatever in that dashboard is adding that label filter, although you may also want to take the issue up with the Grafana support community, because it shouldn't be a problem.

The smartmon_attr_raw_value{name="temperature_celsius"} metric should give you the drive temperatures that you need. You are correct that the storcli.py collector does not expose physical disk temperatures (although that information is available from storcli - it's just not implemented in the collector script).

For the controller temperature, you will of course definitely need to use the storcli.py collector, and refer to the megaraid_temperature metric. You can get array state from the megaraid_vd_info metric.

stefangweichinger commented 1 year ago

@dswarbrick Thanks a lot for confirming and explaining. Currently I see some light. Editing a dashboard, getting sane values for most of the panels, seeing temps etc Yes, I removed these model-number queries. Maybe adding them later if I understand things better or so.

So the basics seem to work now, as far as I see: thanks!

Will your script in that branch be changed soon? Any plans? Just asking because I would then have to track that somehow.

I use https://grafana.com/grafana/dashboards/10530-s-m-a-r-t-disk-monitoring-for-prometheus-dashboard/ as a base and edit it right now. Looks quite old, some queries use metrics which I don't find in the output of your storcli.py. Not a big issue, I don't need all of them.

dswarbrick commented 1 year ago

@stefangweichinger There is a PR to merge it into master: #153

I'm trying to do some long-overdue cleanup and modernization of the collectors in this repo.

bjornfor commented 3 months ago

ah, I see. But no direct solution available. I also find https://github.com/prometheus-community/smartctl_exporter ... doesn't work very well either.

What's the problem with it? I recently migrated from smartmon.py to smartctl_exporter because the smartmon.py script doesn't expose SMART attribute ID 11 (Calibration Retry Count). And I later learned that smartmon.py doesn't work with NVMe drives either: https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/issues/110.

@stefangweichinger There is a PR to merge it into master: #153

Since that's merged now, does that mean this issue can be closed?

prometheus-community / node-exporter-textfile-collector-scripts

smartmon scripts: quote in value #149

deleting