Closed zxzharmlesszxz closed 6 months ago
I might be missing something -- how does this cause smartctl -dmegaraid,N
to be run?
through the -d flag I think, but its not clear to me how device type is determined?
I'm thinking that the PR description isn't quite what it does.
I might be missing something -- how does this cause
smartctl -dmegaraid,N
to be run?
No need to set any devices. You can set device-include/device-exclude with right regex My variant uses discovery type of device and than use it in scrape
I might be missing something -- how does this cause
smartctl -dmegaraid,N
to be run?No need to set any devices. You can set device-include/device-exclude with right regex My variant uses discovery type of device and than use it in scrape
There is no standalone /dev/xxx
entry, though. Or are you saying that one would device-include
like /dev/bus/0 -d megaraid,0
or d megaraid,0 /dev/sda
? Which strictly speaking aren't devices as such but smartctl
commands? It would seem that this strategy would place the onus of device discovery on the admin, who has to write code to handle local conditions? I'm not arguing, I want to understand the approach here.
Personally I despise HBA RAID but I have thousands of them that aren't going away anytime soon.
Dell's BOSS-N1, notably, lacks any way to pass through plain drives, it only exposes VDs to the system.
I might be missing something -- how does this cause
smartctl -dmegaraid,N
to be run?No need to set any devices. You can set device-include/device-exclude with right regex My variant uses discovery type of device and than use it in scrape
There is no standalone
/dev/xxx
entry, though. Or are you saying that one woulddevice-include
like/dev/bus/0 -d megaraid,0
ord megaraid,0 /dev/sda
? Which strictly speaking aren't devices as such butsmartctl
commands? It would seem that this strategy would place the onus of device discovery on the admin, who has to write code to handle local conditions? I'm not arguing, I want to understand the approach here.Personally I despise HBA RAID but I have thousands of them that aren't going away anytime soon.
Dell's BOSS-N1, notably, lacks any way to pass through plain drives, it only exposes VDs to the system.
Rigth now? I have situation:
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/sdc -d scsi # /dev/sdc, SCSI device
/dev/sdd -d scsi # /dev/sdd, SCSI device
/dev/sde -d scsi # /dev/sde, SCSI device
/dev/sdf -d scsi # /dev/sdf, SCSI device
/dev/sdg -d scsi # /dev/sdg, SCSI device
/dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], SCSI device
/dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], SCSI device
/dev/bus/0 -d megaraid,2 # /dev/bus/0 [megaraid_disk_02], SCSI device
/dev/bus/0 -d megaraid,3 # /dev/bus/0 [megaraid_disk_03], SCSI device
/dev/bus/0 -d megaraid,4 # /dev/bus/0 [megaraid_disk_04], SCSI device
/dev/bus/0 -d megaraid,5 # /dev/bus/0 [megaraid_disk_05], SCSI device
/dev/bus/0 -d megaraid,6 # /dev/bus/0 [megaraid_disk_06], SCSI device
/dev/bus/0 -d megaraid,7 # /dev/bus/0 [megaraid_disk_07], SCSI device
where:
/dev/sdg -d scsi # /dev/sdg, SCSI device
is raid0 by
/dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], SCSI device /dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], SCSI device
and /dev/sdg
failing any scan, but successfully read metrics by /dev/bus/0 -d megaraid,0 and /dev/bus/0 -d megaraid,1 separately.
In my realization of exporter device(-d megaraid,0; -d cassis,0;etc) automaticaly adding to smartctl cmd
But I don't see code that adds "-d megaraid"
But I don't see code that adds "-d megaraid"
func readSMARTctl(logger log.Logger, device Device) (gjson.Result, bool) {
start := time.Now()
out, err := exec.Command(*smartctlPath, "--json", "--info", "--health", "--attributes", "--tolerance=verypermissive", "--nocheck=standby", "--format=brief", "--log=error", device.Name, "-d", device.Type).Output()
It's "-d", device.Type
LGTM but I suspect that someone with more privs will need to approve/merge.
You'll also need to sign-off the commits, see the Details link next to DCO above.
# /opt/MegaRAID/storcli/storcli64 /c0 /eall /sall show
CLI Version = 007.2408.0000.0000 Nov 15, 2022
Operating system = Linux 5.18.15-1.el8.elrepo.x86_64
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.
Drive Information :
=================
----------------------------------------------------------------------------------
EID:Slt DID State DG Size Intf Med SED PI SeSz Model Sp Type
----------------------------------------------------------------------------------
32:0 0 JBOD - 186.310 GB SATA SSD N N 512B INTEL SSDSC2BX200G4R U -
32:1 1 JBOD - 931.512 GB SATA HDD N N 512B TOSHIBA MG03ACA100 U -
32:2 2 JBOD - 931.512 GB SATA HDD N N 512B TOSHIBA MG03ACA100 U -
32:3 3 JBOD - 931.512 GB SATA HDD N N 512B TOSHIBA MG03ACA100 U -
----------------------------------------------------------------------------------
# /opt/index/sbin/smartctl -a -dmegaraid,0 /dev/sda
smartctl 7.3 2022-02-28 r5338 [x86_64-linux-5.18.15-1.el8.elrepo.x86_64] (local build)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Intel 730 and DC S35x0/3610/3700 Series SSDs
Device Model: INTEL SSDSC2BX200G4R
Serial Number: BTHC706208H1200TGN
LU WWN Device Id: 5 5cd2e4 14da4858f
Add. Product Id: DELL(tm)
Firmware Version: G201DL2E
User Capacity: 200,049,647,616 bytes [200 GB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
TRIM Command: Available, deterministic, zeroed
Device is: In smartctl database 7.3/5319
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Tue Mar 5 01:20:36 2024 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Status not supported: ATA return descriptor not supported by controller firmware
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
General SMART Values:
Offline data collection status: (0x02) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 2) seconds.
Offline data collection
capabilities: (0x79) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 60) minutes.
Conveyance self-test routine
recommended polling time: ( 60) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000e 130 130 039 Old_age Always - 11408566
5 Reallocated_Sector_Ct 0x0033 100 100 001 Pre-fail Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 48407
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 32
13 Read_Soft_Error_Rate 0x001e 130 130 000 Old_age Always - 11408566
170 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
174 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 30
179 Used_Rsvd_Blk_Cnt_Tot 0x0033 100 100 010 Pre-fail Always - 0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0032 100 100 000 Old_age Always - 5453
181 Program_Fail_Cnt_Total 0x003a 100 100 000 Old_age Always - 0
182 Erase_Fail_Count_Total 0x003a 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Internal 0x0022 100 100 000 Old_age Always - 28
195 Hardware_ECC_Recovered 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
201 Unknown_SSD_Attribute 0x0033 100 100 010 Pre-fail Always - 1237594477198
202 Unknown_SSD_Attribute 0x0027 100 100 000 Pre-fail Always - 0
225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 2098414
226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 102400
227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 0
228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 3786170728
232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 2098414
234 Thermal_Throttle 0x0032 100 100 000 Old_age Always - 0/0
241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 2098414
242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always - 16492
245 Unknown_Attribute 0x0032 097 097 000 Old_age Always - 97
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 3 -
# 2 Extended offline Completed without error 00% 1 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
@zxzharmlesszxz Before this can be merged, you will need to sign all commits. Visit the checks tab and follow the instructions.
To help with testing, can you provide the output of smartctl --scan --json
? Put the output in spoilers to improve readability.
How long time to wait merge?
We are all volunteers. Please be patient with the review and merge process.
@zxzharmlesszxz something that @NiceGuyIT didn't say, but would make the process MUCH easier is rebasing this onto the upstream master, to one or more logically separate commits.
I'm going to include more comments inline to your most recent changes.
/opt/index/sbin/smartctl --scan ./smartctl_exporter --help ./smartctl_exporter --smartctl.path=/opt/index/sbin/smartctl & curl http://localhost:9633/metrics
Hi @anthonyeleven, I added spoilers to your comment to reduce the scrolling.
@zxzharmlesszxz something that @NiceGuyIT didn't say, but would make the process MUCH easier is rebasing this onto the upstream master, to one or more logically separate commits.
168 is an example where I fixed the SCSI/SAS metrics & labels.
I'm going to include more comments inline to your most recent changes.
Thank you @robbat2 for mentoring @zxzharmlesszxz with this PR.
@zxzharmlesszxz It may seem daunting to submit changes to a project when you're not an experienced developer; stick with it and the result will be worth it. As you saw in PR #107, this is not an easy fix. Other people's comments are to make sure we don't break X when fixing Y. Keep with it. đź‘Ť You'll get there.
spoiler Ack, I'd not known that was a thing.
@zxzharmlesszxz something that @NiceGuyIT didn't say, but would make the process MUCH easier is rebasing this onto the upstream master, to one or more logically separate commits.
168 is an example where I fixed the SCSI/SAS metrics & labels.
I'm going to include more comments inline to your most recent changes.
Thank you @robbat2 for mentoring @zxzharmlesszxz with this PR.
@zxzharmlesszxz It may seem daunting to submit changes to a project when you're not an experienced developer; stick with it and the result will be worth it. As you saw in PR #107, this is not an easy fix. Other people's comments are to make sure we don't break X when fixing Y. Keep with it. đź‘Ť You'll get there.
In PR #107 method to specify type only for one device but how to make this for multiple? In this PR #205 I made changes for auto discover types of devices and than use it in scraping. Today I made little bit fixes of code. Please give me more information about what I must do now
@zxzharmlesszxz Before this can be merged, you will need to sign all commits. Visit the checks tab and follow the instructions.
To help with testing, can you provide the output of
smartctl --scan --json
? Put the output in spoilers to improve readability.How long time to wait merge?
We are all volunteers. Please be patient with the review and merge process.
You provide all devices and other information to node_exporter too? Prometheus != zabbix
@zxzharmlesszxz Before this can be merged, you will need to sign all commits. Visit the checks tab and follow the instructions. To help with testing, can you provide the output of
smartctl --scan --json
? Put the output in spoilers to improve readability.How long time to wait merge?
We are all volunteers. Please be patient with the review and merge process.
You provide all devices and other information to node_exporter too? Prometheus != zabbix
I don't follow. I'm asking for the output of smartctl --scan --json
because that's what smartctl_exporter
uses to find devices. I need this information to evaluate this PR because I don't have devices that fit the criteria of this PR.
@zxzharmlesszxz Before this can be merged, you will need to sign all commits. Visit the checks tab and follow the instructions. To help with testing, can you provide the output of
smartctl --scan --json
? Put the output in spoilers to improve readability.How long time to wait merge?
We are all volunteers. Please be patient with the review and merge process.
You provide all devices and other information to node_exporter too? Prometheus != zabbix
I don't follow. I'm asking for the output of
smartctl --scan --json
because that's whatsmartctl_exporter
uses to find devices. I need this information to evaluate this PR because I don't have devices that fit the criteria of this PR. Please CMD:smartctl --scan --json
OUTPUT:{ "json_format_version": [ 1, 0 ], "smartctl": { "version": [ 7, 2 ], "svn_revision": "5155", "platform_info": "x86_64-linux-5.10.0-20-amd64", "build_info": "(local build)", "argv": [ "smartctl", "--scan", "--json" ], "exit_status": 0 }, "devices": [ { "name": "/dev/sda", "info_name": "/dev/sda", "type": "scsi", "protocol": "SCSI" }, { "name": "/dev/sdb", "info_name": "/dev/sdb", "type": "scsi", "protocol": "SCSI" }, { "name": "/dev/sdc", "info_name": "/dev/sdc", "type": "scsi", "protocol": "SCSI" }, { "name": "/dev/sdd", "info_name": "/dev/sdd", "type": "scsi", "protocol": "SCSI" }, { "name": "/dev/sde", "info_name": "/dev/sde", "type": "scsi", "protocol": "SCSI" }, { "name": "/dev/sdf", "info_name": "/dev/sdf", "type": "scsi", "protocol": "SCSI" }, { "name": "/dev/sdg", "info_name": "/dev/sdg", "type": "scsi", "protocol": "SCSI" }, { "name": "/dev/bus/0", "info_name": "/dev/bus/0 [megaraid_disk_00]", "type": "megaraid,0", "protocol": "SCSI" }, { "name": "/dev/bus/0", "info_name": "/dev/bus/0 [megaraid_disk_01]", "type": "megaraid,1", "protocol": "SCSI" }, { "name": "/dev/bus/0", "info_name": "/dev/bus/0 [megaraid_disk_02]", "type": "megaraid,2", "protocol": "SCSI" }, { "name": "/dev/bus/0", "info_name": "/dev/bus/0 [megaraid_disk_03]", "type": "megaraid,3", "protocol": "SCSI" }, { "name": "/dev/bus/0", "info_name": "/dev/bus/0 [megaraid_disk_04]", "type": "megaraid,4", "protocol": "SCSI" }, { "name": "/dev/bus/0", "info_name": "/dev/bus/0 [megaraid_disk_05]", "type": "megaraid,5", "protocol": "SCSI" }, { "name": "/dev/bus/0", "info_name": "/dev/bus/0 [megaraid_disk_06]", "type": "megaraid,6", "protocol": "SCSI" }, { "name": "/dev/bus/0", "info_name": "/dev/bus/0 [megaraid_disk_07]", "type": "megaraid,7", "protocol": "SCSI" } ] }
@zxzharmlesszxz much cleaner, thank you. Just rebase it to this point.
The existing json test data I added was for specific devices, and didn't capture the --scan variant. We should probably figure out how to add that, esp for devices that don't turn up in --scan.
@zxzharmlesszxz much cleaner, thank you. Just rebase it to this point.
The existing json test data I added was for specific devices, and didn't capture the --scan variant. We should probably figure out how to add that, esp for devices that don't turn up in --scan.
Maybe but not in this PR
Hi! And what my next step?
Hi! And what my next step?
Please clean up your commits; rebase and squash into a logical series.
Hi! And what my next step?
Please clean up your commits; rebase and squash into a logical series.
Done. I have a next question - who musqt set new version and write changelog?
Done. I have a next question - who musqt set new version and write changelog? That's handled by the release process automatically.
This just needs larger testing now, to see if it breaks existing configurations. I'm esp worried about those that ran with manually configured devices, because --scan did not support their hardware.
If run with --smartctl.device=/dev/nvme0n1
; the PR version does not show ANY devices. => this would be a breaking change, likely the logic should change, either don't scan when devices explicitly specified, or do a better job matching.
From a contact I got the --scan output for a system with multiple MegaRAID controllers, and it's bad news: the megaraid_disk_NN
label is NOT unique. It must be combined with /dev/bus/N
.
$ sudo smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/bus/0 -d megaraid,2 # /dev/bus/0 [megaraid_disk_02], SCSI device
/dev/bus/1 -d megaraid,2 # /dev/bus/1 [megaraid_disk_02], SCSI device
/dev/bus/1 -d megaraid,3 # /dev/bus/1 [megaraid_disk_03], SCSI device
/dev/bus/1 -d megaraid,4 # /dev/bus/1 [megaraid_disk_04], SCSI device
/dev/bus/1 -d megaraid,5 # /dev/bus/1 [megaraid_disk_05], SCSI device
Furthermore, the correct args to pass to smartctl are:
smartctl ... -d megaraid,2 /dev/bus/0 # disk at slot 2 on controller 0
smartctl ... -d megaraid,2 /dev/bus/1 # disk at slot 2 on controller 1
The device
label is going to have to include the bus
part somehow.
Some testing feedback
Explicitly passed devices
If run with
--smartctl.device=/dev/nvme0n1
; the PR version does not show ANY devices. => this would be a breaking change, likely the logic should change, either don't scan when devices explicitly specified, or do a better job matching.Other testing
From a contact I got the --scan output for a system with multiple MegaRAID controllers, and it's bad news: the
megaraid_disk_NN
label is NOT unique. It must be combined with/dev/bus/N
.$ sudo smartctl --scan /dev/sda -d scsi # /dev/sda, SCSI device /dev/sdb -d scsi # /dev/sdb, SCSI device /dev/bus/0 -d megaraid,2 # /dev/bus/0 [megaraid_disk_02], SCSI device /dev/bus/1 -d megaraid,2 # /dev/bus/1 [megaraid_disk_02], SCSI device /dev/bus/1 -d megaraid,3 # /dev/bus/1 [megaraid_disk_03], SCSI device /dev/bus/1 -d megaraid,4 # /dev/bus/1 [megaraid_disk_04], SCSI device /dev/bus/1 -d megaraid,5 # /dev/bus/1 [megaraid_disk_05], SCSI device
Furthermore, the correct args to pass to smartctl are:
smartctl ... -d megaraid,2 /dev/bus/0 # disk at slot 2 on controller 0 smartctl ... -d megaraid,2 /dev/bus/1 # disk at slot 2 on controller 1
The
device
label is going to have to include thebus
part somehow.
This is new information
I could swear I posted this earlier, but here is some detailed input from systems with two LSI HBAs and a mix of passthrough and VD and external embedded RAID devices. Here somehow the HBA's DID on the second HBA does increment from those on the first.
If this is not consistent behavior, it might be a function of the HBA models involved and/or their firmware revisions and/or SAS vs SATA, physical drives vs LUNS from an external embedded RAID array, etc. I've encountered more than a few ... quirks in LSI's lineup and firmware over the years.
I should think that systems with more than one HBA are relatively few and that functionality for such systems could be deferred to a future PR.
And of course there may be systems with a mix of HBA manufacturers, say an LSI and an Areca.
Here's one system
[root@dw24 ~]# /opt/MegaRAID/storcli/storcli64 /call show | grep Product
Product Name = PERC H330 Mini
TSs=Temperature sensor count | Alms=Alarm count | SIM=SIM Count | ProdID=Product ID
Product Name = PERC H830 Adapter
TSs=Temperature sensor count | Alms=Alarm count | SIM=SIM Count | ProdID=Product ID
[root@dw24 ~]#
[root@dw24 ~]# smartctl -a /dev/bus/0 -d megaraid,0 | grep Serial
Serial number: S7K1W1AX
[root@dw24 ~]# smartctl -a /dev/bus/11 -d megaraid,2 | grep Serial
Serial number: 7PGMPXLG
[root@dw24 ~]#
Here's a different system:
[root@dw25 ~]# /opt/MegaRAID/storcli/storcli64 /call show | grep Product
Product Name = PERC H330 Mini
TSs=Temperature sensor count | Alms=Alarm count | SIM=SIM Count | ProdID=Product ID
Product Name = Dell 12Gbps HBA
[root@dw25 ~]#
Note the lack of an EID for the second HBA, I've never seen that before. This has Dell MD34xx external embedded RAID arrays.
In an ideal world I might like the metric labels to reflect the controller number instead of an elliptical /dev/busXX
, to make it easier to correlate with physical drives for replacement, but I recognize that would probably be a lot more work.
I could swear I posted this earlier, but here is some detailed input from systems with two LSI HBAs and a mix of passthrough and VD and external embedded RAID devices. Here somehow the HBA's DID on the second HBA does increment from those on the first.
If this is not consistent behavior, it might be a function of the HBA models involved and/or their firmware revisions and/or SAS vs SATA, physical drives vs LUNS from an external embedded RAID array, etc. I've encountered more than a few ... quirks in LSI's lineup and firmware over the years.
I should think that systems with more than one HBA are relatively few and that functionality for such systems could be deferred to a future PR.
And of course there may be systems with a mix of HBA manufacturers, say an LSI and an Areca.
Here's one system
[root@dw24 ~]# /opt/MegaRAID/storcli/storcli64 /call show | grep Product Product Name = PERC H330 Mini TSs=Temperature sensor count | Alms=Alarm count | SIM=SIM Count | ProdID=Product ID Product Name = PERC H830 Adapter TSs=Temperature sensor count | Alms=Alarm count | SIM=SIM Count | ProdID=Product ID [root@dw24 ~]# [root@dw24 ~]# smartctl -a /dev/bus/0 -d megaraid,0 | grep Serial Serial number: S7K1W1AX [root@dw24 ~]# smartctl -a /dev/bus/11 -d megaraid,2 | grep Serial Serial number: 7PGMPXLG [root@dw24 ~]#
[root@dw24 ~]# /opt/index/sbin/smartctl --scan ; /opt/MegaRAID/storcli/storcli64 /call show /opt/MegaRAID/storcli/storcli64 /call show
Here's a different system:
[root@dw25 ~]# /opt/MegaRAID/storcli/storcli64 /call show | grep Product Product Name = PERC H330 Mini TSs=Temperature sensor count | Alms=Alarm count | SIM=SIM Count | ProdID=Product ID Product Name = Dell 12Gbps HBA [root@dw25 ~]#
[root@dw25 ~]# /opt/index/sbin/smartctl --scan
Note the lack of an EID for the second HBA, I've never seen that before. This has Dell MD34xx external embedded RAID arrays. [root@dw25 ~]# /opt/MegaRAID/storcli/storcli64 /call show [root@dw25 ~]# lsblk
In an ideal world I might like the metric labels to reflect the controller number instead of an elliptical
/dev/busXX
, to make it easier to correlate with physical drives for replacement, but I recognize that would probably be a lot more work.
Add bus name & bus number to disk name, example: bus_0_megaraid_disk_01
Any updates?
@robbat2 @anthonyeleven @k0ste @bitfehler Can one of you test this PR? I don't have any systems that require specifying a device.
I trialed the state as of ~~ 27 days ago https://github.com/prometheus-community/smartctl_exporter/pull/205#issuecomment-1977769288
I've been consumed with other things. Has the code since changed? I'd be happy to rebuild it and post the results, if we can agree on the classes of systems needed so I can push around the binary and collect everything necessary in one go (so to speak). I do have a mix of systems with various combinations of HBA RAID SAS/SATA drives, passthrough SAS/SATA drives, multiple HBAs, and NVMe drives.
@anthonyeleven Yes, the code has changed since then; 2 weeks ago.
As for what to test, any system that requires a -d
/--device
be passed to smartctl
. Pay attention to the device names in Prometheus as that uses a regex to extract the name. Duplicate, missing or malformed names would be suspect.
A previous PR added a ton of smartctl
output for test data. Now we need smartctl --scan --json
data to test PRs such as this. I'll open a separate issue for this request.
I'll try to do some test runs this weekend. Have had a longrunning swarm at work.
On May 3, 2024, at 20:24, David Randall @.***> wrote:
@NiceGuyIT commented on this pull request.
In .gitignore https://github.com/prometheus-community/smartctl_exporter/pull/205#discussion_r1589824475:
@@ -3,6 +3,7 @@ /.release /.tarballs debug/ +.idea/ I don't know what I was thinking. .idea/ is in `.gitignore.
— Reply to this email directly, view it on GitHub https://github.com/prometheus-community/smartctl_exporter/pull/205#discussion_r1589824475, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTVONH342O3IYPNDDZ6UDLZAQTEHAVCNFSM6AAAAABEA6EM62VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDAMZZGE2TSMJRGQ. You are receiving this because you were mentioned.
@anthonyeleven Completely understand the lack of time. This PR is ready to merge unless you find something.
</details
Thanks for the feedback @anthonyeleven!
@zxzharmlesszxz Thank you for your contribution! Sorry for the long delay; the checks and balances are complete.
Thank you for your contribution.
I've had a weird issue with the new --device=
flag passed to smartctl.
--json --scan
output part for /dev/sdi:
{
"name": "/dev/sdi",
"info_name": "/dev/sdi",
"type": "scsi",
"protocol": "SCSI"
},
Then, with the new --device=scsi
parameter, smartctl returns error code 4 without proper data. The disk itself is a SATA disk. When given --device=auto
or without --device
, everything is working again.
Thank you for your contribution. I've had a weird issue with the new
--device=
flag passed to smartctl.
--json --scan
output part for /dev/sdi:{ "name": "/dev/sdi", "info_name": "/dev/sdi", "type": "scsi", "protocol": "SCSI" },
Then, with the new
--device=scsi
parameter, smartctl returns error code 4 without proper data. The disk itself is a SATA disk. When given--device=auto
or without--device
, everything is working again.
Yes, the current code is adapted for Dell hardware. If your hardware is not a Dell, then your SATA hdd is not work anymore. So it's ok because that's what this patch was trying to achieve, isn't it?
Thank you for your contribution. I've had a weird issue with the new
--device=
flag passed to smartctl.--json --scan
output part for /dev/sdi:{ "name": "/dev/sdi", "info_name": "/dev/sdi", "type": "scsi", "protocol": "SCSI" },
Then, with the new
--device=scsi
parameter, smartctl returns error code 4 without proper data. The disk itself is a SATA disk. When given--device=auto
or without--device
, everything is working again.Yes, the current code is adapted for Dell hardware. If your hardware is not a Dell, then your SATA hdd is not work anymore. So it's ok because that's what this patch was trying to achieve, isn't it?
I'm sure I do not understand something, I didn't read every single message in this pull request. You're saying this change is not intended for non-Dell hardware. It's already merged into master, which (for me) means that it should generally work for everything. But it does not work for everything, as you (and me) are stating.
I also didn't find any option to get the old behavior. That is, forcing calling smartctl without --device
or force --device=auto
or something like this.
Background: I've been working/waiting on issue #197 and due to this change getting merged into the master branch, my pull request needed to be rebased. So I rebased my changes onto the new master, tried out my resulting code, and saw the weird failures on only two of my SATA disks. Previously (before this got merged), this was not the case.
I agree, out of the box it shouldn't break.
Also, I got errors even on Dell HW, see above.
You're saying this change is not intended for non-Dell hardware. It's already merged into master, which (for me) means that it should generally work for everything. But it does not work for everything, as you (and me) are stating.
Actually, this was merged, but not released
I also didn't find any option to get the old behavior. That is, forcing calling smartctl without --device or force --device=auto or something like this.
It's not exist. Currently, this is breaking change
Let me try to explain, if you are not read all messages. Here is 4 devices
[root@prom-test smartctl_exporter]# smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/sdc -d scsi # /dev/sdc, SCSI device
/dev/sdd -d scsi # /dev/sdd, SCSI device
/dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], SCSI device
..
/dev/bus/0 -d megaraid,3 # /dev/bus/0 [megaraid_disk_03], SCSI device
But actually for smartctl_exporter
now is a 8 devices in total
smartctl_device{ata_additional_product_id="DELL(tm)",ata_version="ATA8-ACS (minor revision not indicated)",device="bus_0_megaraid_disk_00",firmware_version="FL2H",form_factor="3.5 inches",interface="sat+megaraid,0",model_family="Toshiba 3.5\" MG03ACAxxx(Y) Enterprise HDD",model_name="TOSHIBA MG03ACA100",protocol="ATA",sata_version="SATA 3.0",scsi_product="",scsi_revision="",scsi_vendor="",scsi_version="",serial_number="56H7KIVAF"} 1
smartctl_device{ata_additional_product_id="DELL(tm)",ata_version="ATA8-ACS (minor revision not indicated)",device="bus_0_megaraid_disk_01",firmware_version="FL2H",form_factor="3.5 inches",interface="sat+megaraid,1",model_family="Toshiba 3.5\" MG03ACAxxx(Y) Enterprise HDD",model_name="TOSHIBA MG03ACA100",protocol="ATA",sata_version="SATA 3.0",scsi_product="",scsi_revision="",scsi_vendor="",scsi_version="",serial_number="56H3K3RTF"} 1
smartctl_device{ata_additional_product_id="DELL(tm)",ata_version="ATA8-ACS (minor revision not indicated)",device="bus_0_megaraid_disk_02",firmware_version="FL2H",form_factor="3.5 inches",interface="sat+megaraid,2",model_family="Toshiba 3.5\" MG03ACAxxx(Y) Enterprise HDD",model_name="TOSHIBA MG03ACA100",protocol="ATA",sata_version="SATA 3.0",scsi_product="",scsi_revision="",scsi_vendor="",scsi_version="",serial_number="56H2K9TRF"} 1
smartctl_device{ata_additional_product_id="DELL(tm)",ata_version="ATA8-ACS (minor revision not indicated)",device="bus_0_megaraid_disk_03",firmware_version="FL2H",form_factor="3.5 inches",interface="sat+megaraid,3",model_family="Toshiba 3.5\" MG03ACAxxx(Y) Enterprise HDD",model_name="TOSHIBA MG03ACA100",protocol="ATA",sata_version="SATA 3.0",scsi_product="",scsi_revision="",scsi_vendor="",scsi_version="",serial_number="56H7KIV0F"} 1
smartctl_device{ata_additional_product_id="unknown",ata_version="",device="sda",firmware_version="",form_factor="3.5 inches",interface="scsi",model_family="unknown",model_name="unknown",protocol="SCSI",sata_version="",scsi_product="",scsi_revision="",scsi_vendor="",scsi_version="",serial_number="56H7KIVAF"} 1
smartctl_device{ata_additional_product_id="unknown",ata_version="",device="sdb",firmware_version="",form_factor="3.5 inches",interface="scsi",model_family="unknown",model_name="unknown",protocol="SCSI",sata_version="",scsi_product="",scsi_revision="",scsi_vendor="",scsi_version="",serial_number="56H7KIV0F"} 1
smartctl_device{ata_additional_product_id="unknown",ata_version="",device="sdc",firmware_version="",form_factor="3.5 inches",interface="scsi",model_family="unknown",model_name="unknown",protocol="SCSI",sata_version="",scsi_product="",scsi_revision="",scsi_vendor="",scsi_version="",serial_number="56H3K3RTF"} 1
smartctl_device{ata_additional_product_id="unknown",ata_version="",device="sdd",firmware_version="",form_factor="3.5 inches",interface="scsi",model_family="unknown",model_name="unknown",protocol="SCSI",sata_version="",scsi_product="",scsi_revision="",scsi_vendor="",scsi_version="",serial_number="56H2K9TRF"} 1
If some PromQL makes a join by serial_number
label, the Prometheus return many-to-many error
So, we need to think first about design of exporter, then release the new code
My suggestions is:
--megaraid
option. With this option the new megaraid logic is enabled. The non-megaraid devices should be filtered by administrator via regex filter, for example --smartctl.device-exclude="^/dev/sd[a-z]+"
--megaraid-only
option. This tell exporter to use only megaraid devices. This will fix the doubled devices[root@infra1:/]# cat /etc/conf.d/smartctl_exporter
OPTIONS="--smartctl.interval=600s --smartctl.device-exclude=^/dev/bus/[0-9]+$ --web.listen-address=192.168.102.254:9633
Hope this helped clear things up 🙏
Let's be sure that whatever is done works on systems that have a mix of:
On May 10, 2024, at 13:07, Konstantin Shalygin @.***> wrote:
You're saying this change is not intended for non-Dell hardware. It's already merged into master, which (for me) means that it should generally work for everything. But it does not work for everything, as you (and me) are stating.
Actually, this was merged, but not released
I also didn't find any option to get the old behavior. That is, forcing calling smartctl without --device or force --device=auto or something like this.
It's not exist. Currently, this is breaking change
Let me try to explain, if you are not read all messages. Here is 4 devices
@.*** smartctl_exporter]# smartctl --scan /dev/sda -d scsi # /dev/sda, SCSI device /dev/sdb -d scsi # /dev/sdb, SCSI device /dev/sdc -d scsi # /dev/sdc, SCSI device /dev/sdd -d scsi # /dev/sdd, SCSI device /dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], SCSI device .. /dev/bus/0 -d megaraid,3 # /dev/bus/0 [megaraid_disk_03], SCSI device But actually for smartctl_exporter now is a 8 devices in total
smartctl_device{ata_additional_product_id="DELL(tm)",ata_version="ATA8-ACS (minor revision not indicated)",device="bus_0_megaraid_disk_00",firmware_version="FL2H",form_factor="3.5 inches",interface="sat+megaraid,0",model_family="Toshiba 3.5\" MG03ACAxxx(Y) Enterprise HDD",model_name="TOSHIBA MG03ACA100",protocol="ATA",sata_version="SATA 3.0",scsi_product="",scsi_revision="",scsi_vendor="",scsi_version="",serial_number="56H7KIVAF"} 1 smartctl_device{ata_additional_product_id="DELL(tm)",ata_version="ATA8-ACS (minor revision not indicated)",device="bus_0_megaraid_disk_01",firmware_version="FL2H",form_factor="3.5 inches",interface="sat+megaraid,1",model_family="Toshiba 3.5\" MG03ACAxxx(Y) Enterprise HDD",model_name="TOSHIBA MG03ACA100",protocol="ATA",sata_version="SATA 3.0",scsi_product="",scsi_revision="",scsi_vendor="",scsi_version="",serial_number="56H3K3RTF"} 1 smartctl_device{ata_additional_product_id="DELL(tm)",ata_version="ATA8-ACS (minor revision not indicated)",device="bus_0_megaraid_disk_02",firmware_version="FL2H",form_factor="3.5 inches",interface="sat+megaraid,2",model_family="Toshiba 3.5\" MG03ACAxxx(Y) Enterprise HDD",model_name="TOSHIBA MG03ACA100",protocol="ATA",sata_version="SATA 3.0",scsi_product="",scsi_revision="",scsi_vendor="",scsi_version="",serial_number="56H2K9TRF"} 1 smartctl_device{ata_additional_product_id="DELL(tm)",ata_version="ATA8-ACS (minor revision not indicated)",device="bus_0_megaraid_disk_03",firmware_version="FL2H",form_factor="3.5 inches",interface="sat+megaraid,3",model_family="Toshiba 3.5\" MG03ACAxxx(Y) Enterprise HDD",model_name="TOSHIBA MG03ACA100",protocol="ATA",sata_version="SATA 3.0",scsi_product="",scsi_revision="",scsi_vendor="",scsi_version="",serial_number="56H7KIV0F"} 1 smartctl_device{ata_additional_product_id="unknown",ata_version="",device="sda",firmware_version="",form_factor="3.5 inches",interface="scsi",model_family="unknown",model_name="unknown",protocol="SCSI",sata_version="",scsi_product="",scsi_revision="",scsi_vendor="",scsi_version="",serial_number="56H7KIVAF"} 1 smartctl_device{ata_additional_product_id="unknown",ata_version="",device="sdb",firmware_version="",form_factor="3.5 inches",interface="scsi",model_family="unknown",model_name="unknown",protocol="SCSI",sata_version="",scsi_product="",scsi_revision="",scsi_vendor="",scsi_version="",serial_number="56H7KIV0F"} 1 smartctl_device{ata_additional_product_id="unknown",ata_version="",device="sdc",firmware_version="",form_factor="3.5 inches",interface="scsi",model_family="unknown",model_name="unknown",protocol="SCSI",sata_version="",scsi_product="",scsi_revision="",scsi_vendor="",scsi_version="",serial_number="56H3K3RTF"} 1 smartctl_device{ata_additional_product_id="unknown",ata_version="",device="sdd",firmware_version="",form_factor="3.5 inches",interface="scsi",model_family="unknown",model_name="unknown",protocol="SCSI",sata_version="",scsi_product="",scsi_revision="",scsi_vendor="",scsi_version="",serial_number="56H2K9TRF"} 1 If some PromQL makes a join by serial_number label, the Prometheus return many-to-many error
So, we need to think first about design of exporter, then release the new code
My suggestions is:
Adding the --megaraid option. With this option the new megaraid logic is enabled. The non-megaraid devices should be filtered by administrator via regex filter, for example --smartctl.device-exclude="^/dev/sd[a-z]+" Adding the --megaraid-only option. This tell exporter to use only megaraid devices. This will fix the doubled devices Don't adding the new options, but add new device_type logic only for megaraid devices, so it's omit the device_type option to smartmontools by default, - the current exporter behavior. In this case, administrator should filter megaraid devices by hand via regex filter. Actually, we use this filter for all exporter deployments, to avoid non-disk devices in device counter @.***:/]# cat /etc/conf.d/smartctl_exporter OPTIONS="--smartctl.interval=600s --smartctl.device-exclude=^/dev/bus/[0-9]+$ --web.listen-address=192.168.102.254:9633 Another variant... Hope this helped clear things up 🙏
— Reply to this email directly, view it on GitHub https://github.com/prometheus-community/smartctl_exporter/pull/205#issuecomment-2104969915, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTVOND7N3ZKX3F46K62Q7LZBT5EBAVCNFSM6AAAAABEA6EM62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBUHE3DSOJRGU. You are receiving this because you were mentioned.
Drives that are passed through on an LSI HBA
How it works with Broadcom HBA 9500-16i
3b:00.0 Serial Attached SCSI controller: Broadcom / LSI Fusion-MPT 12GSAS/PCIe Secure SAS38xx
Subsystem: Broadcom / LSI 9500-16i Tri-Mode HBA
And the same server with the current master
[root@host]# smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/sdc -d scsi # /dev/sdc, SCSI device
/dev/sdd -d scsi # /dev/sdd, SCSI device
/dev/sde -d scsi # /dev/sde, SCSI device
/dev/sdf -d scsi # /dev/sdf, SCSI device
/dev/sdg -d scsi # /dev/sdg, SCSI device
/dev/sdh -d scsi # /dev/sdh, SCSI device
/dev/sdi -d scsi # /dev/sdi, SCSI device
/dev/sdj -d scsi # /dev/sdj, SCSI device
/dev/sdk -d scsi # /dev/sdk, SCSI device
/dev/sdl -d scsi # /dev/sdl, SCSI device
/dev/sdm -d scsi # /dev/sdm, SCSI device
/dev/sdn -d scsi # /dev/sdn, SCSI device
/dev/sdo -d scsi # /dev/sdo, SCSI device
/dev/sdp -d scsi # /dev/sdp, SCSI device
/dev/sdq -d scsi # /dev/sdq, SCSI device
/dev/sdr -d scsi # /dev/sdr, SCSI device
/dev/sds -d scsi # /dev/sds, SCSI device
/dev/sdt -d scsi # /dev/sdt, SCSI device
/dev/sdu -d scsi # /dev/sdu, SCSI device
/dev/sdv -d scsi # /dev/sdv, SCSI device
/dev/sdw -d scsi # /dev/sdw, SCSI device
/dev/sdx -d scsi # /dev/sdx, SCSI device
/dev/nvme0 -d nvme # /dev/nvme0, NVMe device
/dev/nvme1 -d nvme # /dev/nvme1, NVMe device
For those who might not know, Broadcom == Avago == LSI == Dell PERC
On May 10, 2024, at 13:42, Konstantin Shalygin @.***> wrote:
Drives that are passed through on an LSI HBA
How it works with Broadcom HBA 9500-16i
3b:00.0 Serial Attached SCSI controller: Broadcom / LSI Fusion-MPT 12GSAS/PCIe Secure SAS38xx Subsystem: Broadcom / LSI 9500-16i Tri-Mode HBA smartctl_exporter 0.12.0 ts=2024-05-10T17:36:51.045Z caller=main.go:140 level=info msg="Starting smartctl_exporter" version="(version=0.12.0, branch=master, revision=1.el8)" ts=2024-05-10T17:36:51.045Z caller=main.go:141 level=info msg="Build context" build_context="(go=go1.21.7 (Red Hat 1.21.7-1.module_el8+960+4060efbe), platform=linux/amd64, @.***, date=20240304, tags=unknown)" ts=2024-05-10T17:36:51.045Z caller=main.go:147 level=info msg="No devices specified, trying to load them automatically" ts=2024-05-10T17:36:51.046Z caller=readjson.go:79 level=debug msg="Scanning for devices" ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sda ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sdb ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sdc ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sdd ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sde ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sdf ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sdg ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sdh ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sdi ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sdj ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sdk ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sdl ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sdm ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sdn ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sdo ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sdp ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sdq ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sdr ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sds ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sdt ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sdu ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sdv ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sdw ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/sdx ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/nvme0 ts=2024-05-10T17:36:51.065Z caller=main.go:120 level=info msg="Found device" name=/dev/nvme1 ts=2024-05-10T17:36:51.065Z caller=main.go:149 level=info msg="Number of devices found" count=26 ts=2024-05-10T17:36:51.065Z caller=main.go:158 level=info msg="Start background scan process" ts=2024-05-10T17:36:51.065Z caller=main.go:159 level=info msg="Rescanning for devices every" rescanInterval=10m0s ts=2024-05-10T17:36:51.351Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sda duration=286.192113ms ts=2024-05-10T17:36:51.351Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sda family="Toshiba X300" model="TOSHIBA HDWE160" ts=2024-05-10T17:36:51.635Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sdb duration=282.917796ms ts=2024-05-10T17:36:51.636Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sdb family="Toshiba X300" model="TOSHIBA HDWE160" ts=2024-05-10T17:36:51.703Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sdc duration=65.960245ms ts=2024-05-10T17:36:51.703Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sdc family=unknown model=WUS721010ALE6L4 ts=2024-05-10T17:36:52.148Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sdd duration=444.575557ms ts=2024-05-10T17:36:52.148Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sdd family="Toshiba X300" model="TOSHIBA HDWE160" ts=2024-05-10T17:36:52.184Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sde duration=34.875406ms ts=2024-05-10T17:36:52.185Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sde family=unknown model=WUS721010ALE6L4 ts=2024-05-10T17:36:52.220Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sdf duration=34.298321ms ts=2024-05-10T17:36:52.220Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sdf family=unknown model="WDC WUS721010ALE6L4" ts=2024-05-10T17:36:52.251Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sdg duration=30.437211ms ts=2024-05-10T17:36:52.251Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sdg family="Toshiba MG06ACA... Enterprise Capacity HDD" model="TOSHIBA MG06ACA10TE" ts=2024-05-10T17:36:52.550Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sdh duration=297.92271ms ts=2024-05-10T17:36:52.550Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sdh family="Toshiba X300" model="TOSHIBA HDWE160" ts=2024-05-10T17:36:52.831Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sdi duration=279.672304ms ts=2024-05-10T17:36:52.831Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sdi family="Toshiba X300" model="TOSHIBA HDWE160" ts=2024-05-10T17:36:53.126Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sdj duration=292.331348ms ts=2024-05-10T17:36:53.126Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sdj family="Toshiba X300" model="TOSHIBA HDWE160" ts=2024-05-10T17:36:53.399Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sdk duration=272.004944ms ts=2024-05-10T17:36:53.399Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sdk family="Toshiba X300" model="TOSHIBA HDWE160" ts=2024-05-10T17:36:53.692Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sdl duration=291.55157ms ts=2024-05-10T17:36:53.692Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sdl family="Toshiba X300" model="TOSHIBA HDWE160" ts=2024-05-10T17:36:53.724Z caller=readjson.go:69 level=warn msg="S.M.A.R.T. output reading" err="exit status 64" device=/dev/sdm ts=2024-05-10T17:36:53.724Z caller=readjson.go:140 level=warn msg="The device error log contains records of errors" device=/dev/sdm ts=2024-05-10T17:36:53.724Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sdm duration=31.055839ms ts=2024-05-10T17:36:53.724Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sdm family="Toshiba MG06ACA... Enterprise Capacity HDD" model="TOSHIBA MG06ACA10TE" ts=2024-05-10T17:36:54.012Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sdn duration=286.293623ms ts=2024-05-10T17:36:54.012Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sdn family="Toshiba X300" model="TOSHIBA HDWE160" ts=2024-05-10T17:36:54.325Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sdo duration=312.568885ms ts=2024-05-10T17:36:54.325Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sdo family="Toshiba X300" model="TOSHIBA HDWE160" ts=2024-05-10T17:36:54.624Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sdp duration=297.93094ms ts=2024-05-10T17:36:54.624Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sdp family="Toshiba X300" model="TOSHIBA HDWE160" ts=2024-05-10T17:36:54.906Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sdq duration=280.451251ms ts=2024-05-10T17:36:54.906Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sdq family="Toshiba X300" model="TOSHIBA HDWE160" ts=2024-05-10T17:36:55.186Z caller=readjson.go:69 level=warn msg="S.M.A.R.T. output reading" err="exit status 64" device=/dev/sdr ts=2024-05-10T17:36:55.186Z caller=readjson.go:140 level=warn msg="The device error log contains records of errors" device=/dev/sdr ts=2024-05-10T17:36:55.186Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sdr duration=279.37659ms ts=2024-05-10T17:36:55.186Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sdr family="Toshiba X300" model="TOSHIBA HDWE160" ts=2024-05-10T17:36:55.468Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sds duration=280.076152ms ts=2024-05-10T17:36:55.468Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sds family="Toshiba X300" model="TOSHIBA HDWE160" ts=2024-05-10T17:36:55.760Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sdt duration=291.624193ms ts=2024-05-10T17:36:55.760Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sdt family="Toshiba X300" model="TOSHIBA HDWE160" ts=2024-05-10T17:36:56.050Z caller=readjson.go:69 level=warn msg="S.M.A.R.T. output reading" err="exit status 64" device=/dev/sdu ts=2024-05-10T17:36:56.050Z caller=readjson.go:140 level=warn msg="The device error log contains records of errors" device=/dev/sdu ts=2024-05-10T17:36:56.050Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sdu duration=287.490146ms ts=2024-05-10T17:36:56.050Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sdu family="Toshiba X300" model="TOSHIBA HDWE160" ts=2024-05-10T17:36:56.319Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sdv duration=267.760124ms ts=2024-05-10T17:36:56.319Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sdv family="Toshiba X300" model="TOSHIBA HDWE160" ts=2024-05-10T17:36:56.596Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sdw duration=276.673288ms ts=2024-05-10T17:36:56.597Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sdw family="Toshiba X300" model="TOSHIBA HDWE160" ts=2024-05-10T17:36:56.632Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/sdx duration=34.654125ms ts=2024-05-10T17:36:56.632Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=sdx family=unknown model="WDC WUS721010ALE6L4" ts=2024-05-10T17:36:56.653Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/nvme0 duration=19.965395ms ts=2024-05-10T17:36:56.653Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=nvme0 family=unknown model=Micron_7300_MTFDHBE3T2TDG ts=2024-05-10T17:36:56.674Z caller=readjson.go:74 level=debug msg="Collected S.M.A.R.T. json data" device=/dev/nvme1 duration=20.415503ms ts=2024-05-10T17:36:56.674Z caller=smartctl.go:75 level=debug msg="Collecting metrics from" device=nvme1 family=unknown model=Micron_7300_MTFDHBE3T2TDG ts=2024-05-10T17:36:56.674Z caller=tls_config.go:313 level=info msg="Listening on" address=192.168.100.26:9630 ts=2024-05-10T17:36:56.674Z caller=tls_config.go:316 level=info msg="TLS is disabled." http2=false address=192.168.100.26:9630 And the same server with the current master
smartctl_exporter master @.***# smartctl --scan /dev/sda -d scsi # /dev/sda, SCSI device /dev/sdb -d scsi # /dev/sdb, SCSI device /dev/sdc -d scsi # /dev/sdc, SCSI device /dev/sdd -d scsi # /dev/sdd, SCSI device /dev/sde -d scsi # /dev/sde, SCSI device /dev/sdf -d scsi # /dev/sdf, SCSI device /dev/sdg -d scsi # /dev/sdg, SCSI device /dev/sdh -d scsi # /dev/sdh, SCSI device /dev/sdi -d scsi # /dev/sdi, SCSI device /dev/sdj -d scsi # /dev/sdj, SCSI device /dev/sdk -d scsi # /dev/sdk, SCSI device /dev/sdl -d scsi # /dev/sdl, SCSI device /dev/sdm -d scsi # /dev/sdm, SCSI device /dev/sdn -d scsi # /dev/sdn, SCSI device /dev/sdo -d scsi # /dev/sdo, SCSI device /dev/sdp -d scsi # /dev/sdp, SCSI device /dev/sdq -d scsi # /dev/sdq, SCSI device /dev/sdr -d scsi # /dev/sdr, SCSI device /dev/sds -d scsi # /dev/sds, SCSI device /dev/sdt -d scsi # /dev/sdt, SCSI device /dev/sdu -d scsi # /dev/sdu, SCSI device /dev/sdv -d scsi # /dev/sdv, SCSI device /dev/sdw -d scsi # /dev/sdw, SCSI device /dev/sdx -d scsi # /dev/sdx, SCSI device /dev/nvme0 -d nvme # /dev/nvme0, NVMe device /dev/nvme1 -d nvme # /dev/nvme1, NVMe device — Reply to this email directly, view it on GitHub https://github.com/prometheus-community/smartctl_exporter/pull/205#issuecomment-2105016313, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTVONEIRGE7XGSGEF4E4KTZBUBITAVCNFSM6AAAAABEA6EM62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBVGAYTMMZRGM. You are receiving this because you were mentioned.
This change is not going to be released until it's fixed (preferred) or reverted.
@k0ste or @robbat2, would one of you mind being a maintainer to help test changes like this? I don't have a system to test advanced drive setups and #227 needs implementing before I can use the smartctl --json
output others have provided. I would hate for this project to suffer due to my lack of hardware.
I can test live combos of VD, passthrough, SAS, SATA, NVMe drives. Only thing I don't have -- thankfully -- is NVMe VDs.
On May 11, 2024, at 17:55, David Randall @.***> wrote:
This change is not going to be released until it's fixed (preferred) or reverted.
@k0ste https://github.com/k0ste or @robbat2 https://github.com/robbat2, would one of you mind being a maintainer to help test changes like this? I don't have a system to test advanced drive setups and #227 https://github.com/prometheus-community/smartctl_exporter/issues/227 needs implementing before I can use the smartctl --json output others have provided. I would hate for this project to suffer due to my lack of hardware.
— Reply to this email directly, view it on GitHub https://github.com/prometheus-community/smartctl_exporter/pull/205#issuecomment-2106040054, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTVONCEEC2ML6L6CKTYLA3ZB2HUNAVCNFSM6AAAAABEA6EM62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBWGA2DAMBVGQ. You are receiving this because you were mentioned.
My suggestions is:
- Adding the
--megaraid
option. With this option the new megaraid logic is enabled. The non-megaraid devices should be filtered by administrator via regex filter, for example--smartctl.device-exclude="^/dev/sd[a-z]+"
- Adding the
--megaraid-only
option. This tell exporter to use only megaraid devices. This will fix the doubled devices- Don't adding the new options, but add new device_type logic only for megaraid devices, so it's omit the device_type option to smartmontools by default, - the current exporter behavior. In this case, administrator should filter megaraid devices by hand via regex filter. Actually, we use this filter for all exporter deployments, to avoid non-disk devices in device counter
[root@infra1:/]# cat /etc/conf.d/smartctl_exporter OPTIONS="--smartctl.interval=600s --smartctl.device-exclude=^/dev/bus/[0-9]+$ --web.listen-address=192.168.102.254:9633
- Another variant...
Could I suggest sticking with the 3rd solution? Because the whole design rationale of this PR seems to be that it does not require any new flags and automatically detects everything. The first and second options would break this. The third variant would only behave differently (compared to the old version) if smartctl does report a megaraid type device.
Hope this helped clear things up 🙏
Indeed, thank you.
My suggestions is:
Adding the --megaraid option. With this option the new megaraid logic is enabled. The non-megaraid devices should be filtered by administrator via regex filter, for example --smartctl.device-exclude="^/dev/sd[a-z]+” Adding the --megaraid-only option. This tell exporter to use only megaraid devices. This will fix the doubled devices What about systems with both, though? I have e.g. systems with multiple NVMe data drives and an LSI RAID HBA that mirrors two boot drives. Not a cost-effective config for sure, but I’m stuck with them for years yet.
Don't adding the new options, but add new device_type logic only for megaraid devices, so it's omit the device_type option to smartmontools by default, - the current exporter behavior. In this case, administrator should filter megaraid devices by hand via regex filter. Actually, we use this filter for all exporter deployments, to avoid non-disk devices in device counter @.***:/]# cat /etc/conf.d/smartctl_exporter OPTIONS="--smartctl.interval=600s --smartctl.device-exclude=^/dev/bus/[0-9]+$ --web.listen-address=192.168.102.254:9633 Would that handle systems with more than one HBA?
Another variant... Could I suggest sticking with the 3rd solution? Because the whole design rationale of this PR seems to be that it does not require any new flags and automatically detects everything.
That would be ideal of course, but “do the desired thing in all kinds of scenarios” often is by far the hardest to implement.
So if we land on a solution that requires that I script up something to statically populate a list of devices to poll, then that isn’t the worst outcome.
The first and second options would break this. The third variant would only behave differently (compared to the old version) if smartctl does report a megaraid type device.
- Drives part of an LSI VD
- Drives on an LSI RAID HBA but passed through
- DHS, GHS, Ugood, Ubad, F (storcli.py)
- Drives hidden behind an LSI RAID HBA but not passed through or part of a VD
- Drives on a non-RAID HBA, e.g. plugged into a chipset SATA port
- NVMe drives (no HBA, thankfully)
- Dell BOSS-S1/S2/N1 drives (mvcli)
- Handful of systems with more than one HBA.
I have various combinations of the above to deal with. Solving for only one per system doesn’t do what I need. ymmv.
Hope this helped clear things up 🙏
Indeed, thank you.
— Reply to this email directly, view it on GitHub https://github.com/prometheus-community/smartctl_exporter/pull/205#issuecomment-2106061144, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADTVONB2BYLIR7TMGSDZSXDZB2VJFAVCNFSM6AAAAABEA6EM62VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMBWGA3DCMJUGQ. You are receiving this because you were mentioned.
Removed smartctl.device param - smartctl.device-include/smartctl.device-exclude fully covers this Added determining device type and use it at scrape data
Maybe fix issues: https://github.com/prometheus-community/smartctl_exporter/issues/89 https://github.com/prometheus-community/smartctl_exporter/issues/26 And no need pr https://github.com/prometheus-community/smartctl_exporter/pull/107