nobody43 / zabbix-mini-IPMI

Disk and CPU temperature monitoring for Linux, FreeBSD and Windows. LLD, trapper.
The Unlicense
91 stars 27 forks source link

Doesn't identify nvme drives or report temps correctly #28

Closed PARitter closed 5 years ago

PARitter commented 5 years ago

Part 1: Smartctl work correctly with most nvme drives, but "smartctl --scan" wont return them so mini-ipmi does not detect or report them correctly. "smartctl --scan -d nvme" will list them correctly, but does not list any other drives. So enumerating nvme, scsi and ata drives requires two separate calls to "smartctl --scan".

Part 2: if you list nvme drives in the diskListManual it reports "no temp" for the drive because smartctl display format for nvme disks is slightly different: "Temperature: 42 Celsius"

I can do a pull request to fix part 2 (its simple) but haven't got a generic fix for part 1.

nobody43 commented 5 years ago

Thank you for the report. 1: I think its better to provide a setting that enables -d nvme call, as different version of smartmontools will behave to this differently: https://bugs.launchpad.net/ubuntu/+source/smartmontools/+bug/1685332 This, as many other repo issues requeries a major rewrite.

2: I'm looking forward to pool requests, but not in this case: it will be rewritten (rather) soon. I'm swimming in technical debt right now.

Related: https://github.com/nobodysu/zabbix-smartmontools/issues/15

nobody43 commented 5 years ago

part 2 is addressed in https://github.com/nobodysu/zabbix-mini-IPMI/commit/8d839b8f7c2b1ccdc4ba7832ca89e4e0d95ab8e6 (only manual nvme will work at this time)

nobody43 commented 5 years ago

@PARitter @rmalenko Any chance you could test it? https://github.com/nobodysu/zabbix-mini-IPMI/tree/refactoring_and_nvme Two scripts and template.

PARitter commented 5 years ago

Tested on: Ubuntu 18.10 / smartmontools 6.6 / zabbix-agent 4.05: works great! Windows 10 Pro 1809 / smartmontools 7.0 / zabbix-agent 4.0.0 (x64): works great!

I'm going to try it on Debian Stretch / Arm64 (dietpi) a bit later. Never run it there before, but should be an interesting test.

Thank you.

nobody43 commented 5 years ago

That's great, thanks. Looking forward to arm test results. Also, can you provide -A -i, -x and -a outputs of an nvme (redacting serials ofc)? That would be pretty helpful.

rmalenko commented 5 years ago

@nobodysu excuse, I hadn't any test. However, I wrote own Zabbix check only for NVME disks. https://github.com/rmalenko/zabbix

PARitter commented 5 years ago

Per your request...though HTML is messing up the formatting...

smartctl -A -i /dev/nvme1

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.18.0-15-generic] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION === Model Number: KXG50PNV2T04 NVMe TOSHIBA 2048GB Serial Number: --------------- Firmware Version: AFDA4103 PCI Vendor/Subsystem ID: 0x1179 IEEE OUI Identifier: 0x00080d Total NVM Capacity: 2,048,408,248,320 [2.04 TB] Unallocated NVM Capacity: 0 Controller ID: 0 Number of Namespaces: 1 Namespace 1 Size/Capacity: 2,048,408,248,320 [2.04 TB] Namespace 1 Formatted LBA Size: 512 Local Time is: Sat Mar 2 17:49:27 2019 UTC

=== START OF SMART DATA SECTION === SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff) Critical Warning: 0x00 Temperature: 32 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0% Data Units Read: 7,830 [4.00 GB] Data Units Written: 4,004,405 [2.05 TB] Host Read Commands: 707,342 Host Write Commands: 1,766,738 Controller Busy Time: 39 Power Cycles: 50 Power On Hours: 989 Unsafe Shutdowns: 28 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 18 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 32 Celsius

smartctl -x /dev/nvme1

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.18.0-15-generic] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION === Model Number: KXG50PNV2T04 NVMe TOSHIBA 2048GB Serial Number: --------------- Firmware Version: AFDA4103 PCI Vendor/Subsystem ID: 0x1179 IEEE OUI Identifier: 0x00080d Total NVM Capacity: 2,048,408,248,320 [2.04 TB] Unallocated NVM Capacity: 0 Controller ID: 0 Number of Namespaces: 1 Namespace 1 Size/Capacity: 2,048,408,248,320 [2.04 TB] Namespace 1 Formatted LBA Size: 512 Local Time is: Sat Mar 2 17:55:32 2019 UTC Firmware Updates (0x14): 2 Slots, no Reset required Optional Admin Commands (0x0017): Security Format Frmw_DL Other Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Other Maximum Data Transfer Size: 512 Pages Warning Comp. Temp. Threshold: 78 Celsius Critical Comp. Temp. Threshold: 82 Celsius Namespace 1 Features (0x02): NA_Fields

Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 6.00W - - 0 0 0 0 0 0 1 + 2.40W - - 1 1 1 1 0 0 2 + 1.90W - - 2 2 2 2 0 0 3 - 0.0500W - - 3 3 3 3 1500 1500 4 - 0.0030W - - 4 4 4 4 50000 90000

Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 2 1 - 4096 0 1

=== START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff) Critical Warning: 0x00 Temperature: 32 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0% Data Units Read: 7,830 [4.00 GB] Data Units Written: 4,004,407 [2.05 TB] Host Read Commands: 707,342 Host Write Commands: 1,766,911 Controller Busy Time: 39 Power Cycles: 50 Power On Hours: 990 Unsafe Shutdowns: 28 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 18 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 32 Celsius

Error Information (NVMe Log 0x01, max 128 entries) No Errors Logged

smartctl -a /dev/nvme1

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.18.0-15-generic] (local build) Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION === Model Number: KXG50PNV2T04 NVMe TOSHIBA 2048GB Serial Number: ---------------- Firmware Version: AFDA4103 PCI Vendor/Subsystem ID: 0x1179 IEEE OUI Identifier: 0x00080d Total NVM Capacity: 2,048,408,248,320 [2.04 TB] Unallocated NVM Capacity: 0 Controller ID: 0 Number of Namespaces: 1 Namespace 1 Size/Capacity: 2,048,408,248,320 [2.04 TB] Namespace 1 Formatted LBA Size: 512 Local Time is: Sat Mar 2 17:56:40 2019 UTC Firmware Updates (0x14): 2 Slots, no Reset required Optional Admin Commands (0x0017): Security Format Frmw_DL Other Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Other Maximum Data Transfer Size: 512 Pages Warning Comp. Temp. Threshold: 78 Celsius Critical Comp. Temp. Threshold: 82 Celsius Namespace 1 Features (0x02): NA_Fields

Supported Power States St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat 0 + 6.00W - - 0 0 0 0 0 0 1 + 2.40W - - 1 1 1 1 0 0 2 + 1.90W - - 2 2 2 2 0 0 3 - 0.0500W - - 3 3 3 3 1500 1500 4 - 0.0030W - - 4 4 4 4 50000 90000

Supported LBA Sizes (NSID 0x1) Id Fmt Data Metadt Rel_Perf 0 + 512 0 2 1 - 4096 0 1

=== START OF SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0xffffffff) Critical Warning: 0x00 Temperature: 32 Celsius Available Spare: 100% Available Spare Threshold: 10% Percentage Used: 0% Data Units Read: 7,830 [4.00 GB] Data Units Written: 4,004,407 [2.05 TB] Host Read Commands: 707,342 Host Write Commands: 1,766,911 Controller Busy Time: 39 Power Cycles: 50 Power On Hours: 990 Unsafe Shutdowns: 28 Media and Data Integrity Errors: 0 Error Information Log Entries: 0 Warning Comp. Temperature Time: 18 Critical Comp. Temperature Time: 0 Temperature Sensor 1: 32 Celsius

Error Information (NVMe Log 0x01, max 128 entries) No Errors Logged

PARitter commented 5 years ago

Putting it up on the ARM SoC will have to wait until I have a bit more time. Two problems:

nobody43 commented 5 years ago

Fixed in https://github.com/nobodysu/zabbix-mini-IPMI/commit/d8aea9faef6da66a2f899cbcce97b118037694ff