Open stefangweichinger opened 1 year ago
Is smartmon.py or smartmon.sh "better"?
Tried this:
# curl -X POST -g 'http://localhost:9090/api/v1/admin/tsdb/delete_series?match[]={__name__="smartmon_device_info", device="/dev/bus/0", instance="that.server.tld:9100", job="node", model_family="Toshiba 3.5" DT01ACA... Desktop HDD"}'
Not yet fully successful ...
There is discussion in #119 to replace both existing smartmon scripts with a new Python script and parse the JSON output of smartctl
, which would likely avoid issues such as this.
ah, I see. But no direct solution available. I also find https://github.com/prometheus-community/smartctl_exporter ... doesn't work very well either.
@stefangweichinger I am working on an interim refactor of the existing smartmon.py to use the official Prometheus client_python library for rendering the metrics. This will solve any quoting issues.
However, since the homegrown metric printing functions in smartmon.py take some liberties with with the validity of the metrics format, it will take a little while.
@stefangweichinger I am working on an interim refactor of the existing smartmon.py to use the official Prometheus client_python library for rendering the metrics. This will solve any quoting issues.
However, since the homegrown metric printing functions in smartmon.py take some liberties with with the validity of the metrics format, it will take a little while.
Wow, sounds good. Good luck with this. I will try to come up with some ugly workaround for those Toshiba disks (hopefully).
@stefangweichinger Please test smartmon.py
from the smartmon.py-registry
branch of this repo.
The original code made heavy use of generators, which was only possible due to the fact that it took some shortcuts with label consistency. When using the official Prometheus client libraries, a metric's label names need to be declared in advance. If a particular data point has no value for a certain label, it should be set to empty string. All labels are rendered by the official client libraries, even when they are empty string.
However, most homegrown metric rendering code that I've seen ignores this completely, and data points often have variable label sets. Node_exporter tolerates this when reading the .prom
files from disk, and ensures consistent label sets when exposing them to a Prometheus scrape.
@dswarbrick thanks, will try asap
@dswarbrick The script works fine so far, I get data/metrics. So far no graphs in the dashboard, I think I use some dashboard that uses wrong queries. For example it seems to mix up "device" and "disk". Do you have a dashboard that you know this new script should work with? Thank you!
Ah, btw: that one disk is parsed like this:
smartmon_device_info{device="/dev/bus/0",device_model="TOSHIBA DT01ACA100",disk="megaraid,5",firmware_version="MS2OA7C0",lun_id="",model_family="Toshiba 3.5\" DT01ACA... Desktop HDD",product="",revision="",serial_number="75NHT0MNS",vendor=""} 1.0
So that quote is escaped inside the field. That might be OK for prometheus and Grafana, I assume.
@stefangweichinger Sorry, I can't advise you on which dashboard is best to use. I usually make my own dashboards, and just use the community ones for inspiration occasionally. I suspect that you might not have the best success with community dashboards since your disks are behind a MegaRAID controller. The dashboard that you're using might be geared more towards consumer-grade setups, i.e. expects disks to be typical /dev/sda
, /dev/sdb
etc.
The escaped double-quote should be fine (and in fact the original script should have been doing that anyway) - see the string replace in:
labels = ','.join(
'{k}="{v}"'.format(k=k, v=v.replace('"', '\\"')) for k, v in metric.labels.items())
@dswarbrick ok .. I have to reduce the moving parts here. I decide to use your latest script to collect data, the prom-files look ok to me, scraping/exporting seems to work, only the displaying and querying seems to be a problem right now. I will try to completely get rid of all smartmon-related time series in prometheus now, to remove these problematic entries from that Toshiba-disk. And then I will try to decide which dashboard to use. I had to edit the variables in one of them, that one used a outdated(?) field in the metrics.
@dswarbrick
For example this query:
avg(smartmon_temperature_celsius_raw_value{instance=~"$instance",disk=~"$disk",type=~"$type",serial_number=~"$serial_number",model_family=~"$model_family",device_model=~"$device_model"}) by (device_model,serial_number)
I only get graphs for one out of three servers currently. I will make sure to use the same script everywhere and cleanup things.
and yes: disks behind that megaraid-hardware gives additional headache
Perhaps I can offer a little more clarification, as I have used this collector in the past at a previous job, with a mixture of simple HBA (or onboard SATA) setups, as well as MegaRAID controllers.
With onboard SATA controllers or HBAs that expose each disk to the OS as individual drives, the device
label will be the block device name that you're familiar with, e.g. /dev/sda
, /dev/sdb
etc, and the disk
label should be 0.
However, consider a MegaRAID controller with multiple disks attached, and e.g. a four-disk RAID5 array. The RAID5 virtual drive may be exposed to the OS as e.g. /dev/sdb
, but smartctl needs a way to address the individual physical disks that make up that virtual drive. In such a case, the device
label refers to the MegaRAID controller, and the disk
label uses a vendor-specific format to identify physical disks on that controller.
If you had multiple MegaRAID controllers, the device
labels would have values /dev/bus/0
, /dev/bus/1
, etc. The disk
labels would contain potentially overlapping IDs - however the combination of device
and disk
label should still uniquely identify a single physical disk.
If you're using a dashboard that primarily focuses on the device
label, that will likely only work for onboard SATA & HBA setups. For MegaRAID controllers, the dashboard will need to take into account both the device
and the disk
labels.
@dswarbrick thanks a lot, yes, understood.
I managed to write a shell script that removes all metrics starting with "smartmon" from prometheus.
Wrote an ansible playbook to deploy your smartmon.py
with working systemd-units (.service, .timer).
So I should start from scratch regarding these metrics.
Data is scraped, but in Grafana I still get stuff like:
bad_data: invalid parameter "query": 1:870: parse error: error parsing regexp: missing closing ): `^(?:(Micron 5100 Pro / 52x0 / 5300 SSDs|N/A|Samsung based SSDs|Seagate Barracuda 7200\.14 \(AF\)|Toshiba 3\.5)$`
Disabled scraping on that backup
server with the Toshiba disk and remove metrics again to maybe get closer.
Still the ugly errors in Grafana, maybe that caches something I have to wait for (yes, reopened the dashboard).
And only getting data from one server (instead of 2 enabled, 1 disabled). fixed by python-library ...
Grepping the prom file from your smartmon.py
it looks as if I should use this metric for my needs:
smartmon_attr_raw_value{device="/dev/bus/0",disk="megaraid,2",name="temperature_celsius"} 21.0
correct?
In my dashboards I see queries with other metrics, I have to get rid of those (although it looks tempting to be able to build on their queries and not having to start from scratch).
Sorry for getting slightly offtopic somehow.
@stefangweichinger Are you writing that regex by hand yourself? If so it does indeed have a missing closing parenthesis. This should work:
^(?:(Micron 5100 Pro / 52x0 / 5300 SSDs|N/A|Samsung based SSDs|Seagate Barracuda 7200\.14 \(AF\)|Toshiba 3\.5))$
You might need to escape the forward-slashes also, since they are delimiters in JavaScript regexes.
If on the other hand that regex has been generated by Grafana (e.g. as a result of a multiple-select drop-down list), that smells like a bug.
You don't specify what exactly your needs are, but if you are wanting to graph the drive temperature, then yes, that metric is a good starting point.
The nature of time-series databases such as Prometheus means that all mistakes are eventually forgotten. If it doesn't bother you too much, just wait for the buggy metrics to age out. It's much less hassle than trying to delete series.
@dswarbrick no that regex comes from the query in the dashboard, I assume.
My needs: basically I'd be happy with some information as given by "megaclisas-status":
# megaclisas-status
-- Controller information --
-- ID | H/W Model | RAM | Temp | BBU | Firmware
c0 | RAID Ctrl SAS 6G 5/6 512MB (D2616) | 512MB | N/A | Good | FW: 12.12.0-0174
-- Array information --
-- ID | Type | Size | Strpsz | Flags | DskCache | Status | OS Path | CacheCade |InProgress
c0u0 | RAID-1 | 279G | 64 KB | ADRA,WB | Enabled | Optimal | /dev/sda | None |None
c0u1 | RAID-1 | 558G | 64 KB | ADRA,WB | Enabled | Optimal | /dev/sdb | None |None
c0u2 | RAID-10 | 1635G | 64 KB | ADRA,WB | Enabled | Optimal | /dev/sdc | None |None
-- Disk information --
-- ID | Type | Drive Model | Size | Status | Speed | Temp | Slot ID | LSI ID
c0u0p0 | HDD | TOSHIBA MBF2300RC 5212EB07PD80AAKH | 278.8 Gb | Online, Spun Up | 6.0Gb/s | 32C | [252:0] | 5
c0u0p1 | HDD | TOSHIBA MBF2300RC 5212EB07PD90APEP | 278.8 Gb | Online, Spun Up | 6.0Gb/s | 31C | [252:4] | 4
c0u1p0 | HDD | WD WD6001BKHG-50D22SFX9WD-WXD1E63RVKV2 | 558.4 Gb | Online, Spun Up | 6.0Gb/s | 30C | [252:1] | 7
c0u1p1 | HDD | WD WD6001BKHG-50D22SFX9WD-WX11E4390674 | 558.4 Gb | Online, Spun Up | 6.0Gb/s | 30C | [252:5] | 8
c0u2s0p0 | HDD | WD WD9001BKHG-50D22SFX9WD-WXD1E63SWTY4 | 837.8 Gb | Online, Spun Up | 6.0Gb/s | 32C | [252:2] | 6
c0u2s0p1 | HDD | TOSHIBA AL14SEB090N 5203Y7G0A00BFSEC | 837.8 Gb | Online, Spun Up | Unknown | 31C | [252:7] | 11
c0u2s1p0 | HDD | TOSHIBA AL14SEB090N 5203Y7H0A01JFSEC | 837.8 Gb | Online, Spun Up | Unknown | 31C | [252:3] | 10
c0u2s1p1 | HDD | TOSHIBA AL14SEB090N 52037870A04WFSEC | 837.8 Gb | Online, Spun Up | Unknown | 31C | [252:6] | 12
I want
The rest would be nice, but not needed. Basically I want some alerts on the temps, done.
I did another look at storcli.py
but that doesn't give me pd temps, as noticed a few days ago.
Maybe I run circles here, definitely losing track of which route to follow.
Generally it would not be advisable to filter by something as specific as drive model number. Probably the only time I've had to do that was when I created some alerting rules for warning about known-buggy firmware versions on particular Intel SSD models. I would recommend removing whatever in that dashboard is adding that label filter, although you may also want to take the issue up with the Grafana support community, because it shouldn't be a problem.
The smartmon_attr_raw_value{name="temperature_celsius"}
metric should give you the drive temperatures that you need. You are correct that the storcli.py
collector does not expose physical disk temperatures (although that information is available from storcli - it's just not implemented in the collector script).
For the controller temperature, you will of course definitely need to use the storcli.py collector, and refer to the megaraid_temperature
metric. You can get array state from the megaraid_vd_info
metric.
@dswarbrick Thanks a lot for confirming and explaining. Currently I see some light. Editing a dashboard, getting sane values for most of the panels, seeing temps etc Yes, I removed these model-number queries. Maybe adding them later if I understand things better or so.
So the basics seem to work now, as far as I see: thanks!
Will your script in that branch be changed soon? Any plans? Just asking because I would then have to track that somehow.
I use https://grafana.com/grafana/dashboards/10530-s-m-a-r-t-disk-monitoring-for-prometheus-dashboard/ as a base and edit it right now. Looks quite old, some queries use metrics which I don't find in the output of your storcli.py
. Not a big issue, I don't need all of them.
@stefangweichinger There is a PR to merge it into master: #153
I'm trying to do some long-overdue cleanup and modernization of the collectors in this repo.
ah, I see. But no direct solution available. I also find https://github.com/prometheus-community/smartctl_exporter ... doesn't work very well either.
What's the problem with it? I recently migrated from smartmon.py to smartctl_exporter because the smartmon.py script doesn't expose SMART attribute ID 11 (Calibration Retry Count). And I later learned that smartmon.py doesn't work with NVMe drives either: https://github.com/prometheus-community/node-exporter-textfile-collector-scripts/issues/110.
@stefangweichinger There is a PR to merge it into master: #153
Since that's merged now, does that mean this issue can be closed?
I fiddle with extracting data out of
smartctl
.A disk gives me this:
The problem is in
There's an escaped quote after "3.5", this leads to issues with the text exporter. I browsed for various forks and versions of smartmon.sh and smartmon.py.
I currently seem to have the extraction fixed, but I seem to have some problematic data lines /metrics in prometheus. Sorry for being a bit offtopic: how can I delete these smartmon-related time series without deleting too much? I need to clean this up before I can dig further and check if the scripts work now.
I don't know if having quotes in the SMART values is OK.