nobody43 / zabbix-mini-IPMI

Disk and CPU temperature monitoring for Linux, FreeBSD and Windows. LLD, trapper.
The Unlicense
91 stars 27 forks source link

How to solve "No CPUs were found for temperature test (mini-IPMI)"? #80

Closed yeroslaviz closed 3 months ago

yeroslaviz commented 7 months ago

Describe the problem After following the instructions and adding all the files I'm getting the problem, that no CPUs can be found using the parameters provided.

The discovery became supported, but I get the message:

No CPUs were found for temperature test (mini-IPMI)

When testing the installation locally, like described in the repository, this is what I get:

# /etc/zabbix/scripts/mini_ipmi_lmsensors.py get 10.33.0.158
{
    "data": []
}

# /etc/zabbix/scripts/mini_ipmi_lmsensors.py getverb 10.33.0.158
...
  Data sent to zabbix sender:

"10.33.0.158" mini.cpu.info[ConfigStatus] "NOGPUS, NOCPUS"
zabbix_sender [607789]: DEBUG: answer [{"response":"success","info":"processed: 0; failed: 1; total: 1; seconds spent: 0.000023"}]
Response from "10.33.0.65:10051": "processed: 0; failed: 1; total: 1; seconds spent: 0.000023"
sent: 1; skipped: 0; total: 1

Provide all outputs described in Testing step Serial numbers should be replaced with X_SERIAL_X.

Please complete the following information:

Additional context

# cat /sys/class/hwmon/hwmon0/temp1_input
24625
root@supercub017:~# sensors -u
bnxt_en-pci-6301
Adapter: PCI adapter
temp1:
  temp1_input: 70.000

k10temp-pci-00c3
Adapter: PCI adapter
Tctl:
  temp1_input: 24.750
Tccd1:
  temp3_input: 25.000
Tccd3:
  temp5_input: 23.000
Tccd5:
  temp7_input: 24.250
Tccd7:
  temp9_input: 22.750

bnxt_en-pci-6300
Adapter: PCI adapter
temp1:
  temp1_input: 70.000

k10temp-pci-00cb
Adapter: PCI adapter
Tctl:
  temp1_input: 27.500
Tccd1:
  temp3_input: 25.750
Tccd3:
  temp5_input: 26.250
Tccd5:
  temp7_input: 26.500
Tccd7:
  temp9_input: 27.250

I also tried to solution you posted in 71, but I can't use it on my system, as it doesn't let the agent start.

any Ideas?

thanks

Assa

nobody43 commented 7 months ago

Hi. Is it two cpus? What's their names?

yeroslaviz commented 7 months ago

Hi,

not sure what you mean by name. Hw do i find this out?

We have a AMD machine with EPYC 7343 16-Core Processors. Yes, we have two.

yeroslaviz commented 7 months ago

Hi again,

just wanted to mention, I have tried the template on a different server. Here it could identify the CPUs and show me the temperature.

Any ideas by now, why it doesn't wrk with the one above?

nobody43 commented 7 months ago

This is because the script does not support EPYC's sensors layout currently. I'm working on a fix.

yeroslaviz commented 7 months ago

Also, another question.

can you maybe tell me, why is it, I can only measure temperature of CPUs, but not disks or motherboards? I know I don't have GPUs on the server, but what do I need to change in order for disks to be readable?

I have a raid structure, but I'm not sure, where to change it in the script to be readable.

thanks

image

nobody43 commented 7 months ago

What's your # smartctl --scan? How do you get SMART output from your drives with smartctl?

yeroslaviz commented 7 months ago

The output

# smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], SCSI device
/dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], SCSI device
/dev/bus/0 -d megaraid,2 # /dev/bus/0 [megaraid_disk_02], SCSI device
/dev/bus/0 -d megaraid,3 # /dev/bus/0 [megaraid_disk_03], SCSI device
/dev/bus/0 -d megaraid,4 # /dev/bus/0 [megaraid_disk_04], SCSI device
/dev/bus/0 -d megaraid,5 # /dev/bus/0 [megaraid_disk_05], SCSI device
/dev/bus/0 -d megaraid,6 # /dev/bus/0 [megaraid_disk_06], SCSI device
/dev/bus/0 -d megaraid,7 # /dev/bus/0 [megaraid_disk_07], SCSI device
/dev/bus/0 -d megaraid,8 # /dev/bus/0 [megaraid_disk_08], SCSI device
/dev/bus/0 -d megaraid,9 # /dev/bus/0 [megaraid_disk_09], SCSI device
/dev/bus/0 -d megaraid,10 # /dev/bus/0 [megaraid_disk_10], SCSI device
/dev/bus/0 -d megaraid,11 # /dev/bus/0 [megaraid_disk_11], SCSI device
/dev/bus/0 -d megaraid,12 # /dev/bus/0 [megaraid_disk_12], SCSI device
/dev/bus/0 -d megaraid,13 # /dev/bus/0 [megaraid_disk_13], SCSI device
/dev/bus/0 -d megaraid,14 # /dev/bus/0 [megaraid_disk_14], SCSI device
/dev/bus/0 -d megaraid,22 # /dev/bus/0 [megaraid_disk_22], SCSI device
/dev/bus/0 -d megaraid,23 # /dev/bus/0 [megaraid_disk_23], SCSI device

To get sepecific information from the drives, I need to include the megaraid argument, so it is as such:

sudo smartctl -x /dev/sda -d megaraid,1
# or for all:
for i in {0..23}; do sudo smartctl -x /dev/sda -d megaraid,$i ;done # >>./OUTPUT; done

What do I need to modify in the script in order for the raid to be included?

yeroslaviz commented 7 months ago

BTW, would the template also work with agent2 of zabbix?

nobody43 commented 7 months ago

Does # smartctl -x /dev/bus/0 -d megaraid,1 outputs anything?

BTW, would the template also work with agent2 of zabbix?

It works if I'm not mistaken.

yeroslaviz commented 7 months ago

the command outputs a lot of information:

# smartctl -x /dev/bus/0 -d megaraid,1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-84-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               SEAGATE
Product:              DL900MP0136
Revision:             KT5C
Compliance:           SPC-4
User Capacity:        900,185,481,216 bytes [900 GB]
Logical block size:   512 bytes
Physical block size:  4096 bytes
Formatted with type 2 protection
8 bytes of protection information per logical block
LU is fully provisioned
Rotation Rate:        15000 rpm
Form Factor:          2.5 inches
Logical Unit id:      0x5000c500b8eaeb6f
Serial number:        WAG0CYD3
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Mon Feb 26 09:13:56 2024 UTC
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Disabled or Not Supported
Read Cache is:        Enabled
Writeback Cache is:   Disabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Grown defects during certification <not available>
Total blocks reassigned during format <not available>
Total new blocks reassigned <not available>
Power on minutes since format <not available>
Current Drive Temperature:     28 C
Drive Trip Temperature:        60 C

Manufactured in week 27 of year 2018
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  36
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  1982
Elements in grown defect list: 0

Vendor (Seagate Cache) information
  Blocks sent to initiator = 4198861016
  Blocks received from initiator = 903783016
  Blocks read from cache and sent to initiator = 865643535
  Number of read and write commands whose size <= segment size = 43965008
  Number of read and write commands whose size > segment size = 54158

Vendor (Seagate/Hitachi) factory information
  number of hours powered up = 46872.50
  number of minutes until next internal SMART test = 4

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:   256764231        0         0  256764231          0      36798.812           0
write:         0        0         0         0          0       5023.806           0
verify: 2953302371        0         0  2953302371          0     244557.470           0

Non-medium error count:      838

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background short  Completed                  96   14877                 - [-   -    -]
# 2  Reserved(7)       Completed                  64      24                 - [-   -    -]
# 3  Background short  Completed                  96      22                 - [-   -    -]

Long (extended) Self-test duration: 5100 seconds [85.0 minutes]

Background scan results log
  Status: waiting until BMS interval timer expires
    Accumulated power on time, hours:minutes 46872:30 [2812350 minutes]
    Number of background scans performed: 333,  scan progress: 0.00%
    Number of background medium scans performed: 333

Protocol Specific port log page for SAS SSP
relative target port id = 1
  generation code = 0
  number of phys = 1
  phy identifier = 0
    attached device type: expander device
    attached reason: SMP phy control function
    reason: loss of dword synchronization
    negotiated logical link rate: phy enabled; 12 Gbps
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=1
    SAS address = 0x5000c500b8eaeb6d
    attached SAS address = 0x500056b33f5590ff
    attached phy identifier = 1
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 2
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 2
     Phy reset problem count: 0
relative target port id = 2
  generation code = 0
  number of phys = 1
  phy identifier = 1
    attached device type: no device attached
    attached reason: unknown
    reason: unknown
    negotiated logical link rate: phy enabled; unknown
    attached initiator port: ssp=0 stp=0 smp=0
    attached target port: ssp=0 stp=0 smp=0
    SAS address = 0x5000c500b8eaeb6e
    attached SAS address = 0x0
    attached phy identifier = 0
    Invalid DWORD count = 0
    Running disparity error count = 0
    Loss of DWORD synchronization = 0
    Phy reset problem = 0
    Phy event descriptors:
     Invalid word count: 0
     Running disparity error count: 0
     Loss of dword synchronization count: 0
     Phy reset problem count: 0
yeroslaviz commented 7 months ago

BTW, would the template also work with agent2 of zabbix?

It works if I'm not mistaken.

Would this mean, I just need to cahnge the path to the agent in the mini_ipmi_lmsensors.py and mini_ipmi_smartctl.py scripts?

from:

agentConf_LINUX    = r'/etc/zabbix/zabbix_agentd.conf'

to

agentConf_LINUX    = r'/etc/zabbix/zabbix_agent2.conf'

and restart the services?

nobody43 commented 7 months ago

Ah, it's a SAS device. Not all fields are supported for them. You still have some items gathered in your zabbix web interface?

and restart the services?

That's correct.

Edit: You might need to also change binary path.

Regarding SAS, have you changed this setting?

yeroslaviz commented 7 months ago

Edit: You might need to also change binary path.

I'm not sure, what you mean. I can execute the command smartctl

# which smartctl
/usr/sbin/smartctl

# smartctl -h 
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-84-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

Usage: smartctl [options] device

============================================ SHOW INFORMATION OPTIONS =====

...

So do I need to change the path to the bin file?

Regarding SAS, have you changed this setting?

I set this parameter to True

It now works also with the agent2, at least for the CPU temperature. Do I need to change anything about the Raid in the scripts?

Any news about the EPYC's sensors?

thanks for all the help

nobody43 commented 7 months ago

Do I need to change anything about the Raid in the scripts?

I'm puzzled why it doesn't work automatically after setting isCheckSAS. Try providing manual configuration here: https://github.com/nobody43/zabbix-mini-IPMI/blob/32246a4ced7ff069b1f4eee749445f03f8ba8fd6/mini_ipmi_smartctl.py#L47 I will solve this inconsistency during future refactoring.

Any news about the EPYC's sensors?

Always occupied, sorry! Try this fix: https://github.com/nobody43/zabbix-mini-IPMI/blob/01c8b27e77545ce501c67fad15d2339a00782997/Linux/mini_ipmi_lmsensors.py#L41

yeroslaviz commented 7 months ago

It might be a good idea to split this issue into two, as it has two different topics. The EPYC sensors and the raid structure

yeroslaviz commented 7 months ago

Do I need to change anything about the Raid in the scripts?

I'm puzzled why it doesn't work automatically after setting isCheckSAS. Try providing manual configuration here:

https://github.com/nobody43/zabbix-mini-IPMI/blob/32246a4ced7ff069b1f4eee749445f03f8ba8fd6/mini_ipmi_smartctl.py#L47

For the Raid manual configuration - Do you mean, I should add all the raid HDDs into the line: e.g.

diskDevsManual = ['/dev/sda -d sat+megaraid,0', '/dev/sda -d sat+megaraid,1', ... '/dev/sda -d sat+megaraid,4', '/dev/sda -d sat+megaraid,5', ... '/dev/sda -d sat+megaraid,23']

Like the output from the command smartctl --scan I gave you above?

I will solve this inconsistency during future refactoring.

Not sure what it means 🤔


When executing the command smartctl --scan on this server I see

# smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/sdb -d scsi # /dev/sdb, SCSI device
/dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], SCSI device
/dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], SCSI device
/dev/bus/0 -d megaraid,2 # /dev/bus/0 [megaraid_disk_02], SCSI device
/dev/bus/0 -d megaraid,3 # /dev/bus/0 [megaraid_disk_03], SCSI device
/dev/bus/0 -d megaraid,4 # /dev/bus/0 [megaraid_disk_04], SCSI device
/dev/bus/0 -d megaraid,5 # /dev/bus/0 [megaraid_disk_05], SCSI device
/dev/bus/0 -d megaraid,6 # /dev/bus/0 [megaraid_disk_06], SCSI device
/dev/bus/0 -d megaraid,7 # /dev/bus/0 [megaraid_disk_07], SCSI device
/dev/bus/0 -d megaraid,8 # /dev/bus/0 [megaraid_disk_08], SCSI device
/dev/bus/0 -d megaraid,9 # /dev/bus/0 [megaraid_disk_09], SCSI device
/dev/bus/0 -d megaraid,10 # /dev/bus/0 [megaraid_disk_10], SCSI device
/dev/bus/0 -d megaraid,11 # /dev/bus/0 [megaraid_disk_11], SCSI device
/dev/bus/0 -d megaraid,12 # /dev/bus/0 [megaraid_disk_12], SCSI device
/dev/bus/0 -d megaraid,13 # /dev/bus/0 [megaraid_disk_13], SCSI device
/dev/bus/0 -d megaraid,14 # /dev/bus/0 [megaraid_disk_14], SCSI device
/dev/bus/0 -d megaraid,22 # /dev/bus/0 [megaraid_disk_22], SCSI device
/dev/bus/0 -d megaraid,23 # /dev/bus/0 [megaraid_disk_23], SCSI device

On This server i don't see any readings from the disk temperatures, BUT

When executing the same command on the server with the EPYC processor I see

# smartctl --scan
/dev/sda -d scsi # /dev/sda, SCSI device
/dev/bus/0 -d megaraid,0 # /dev/bus/0 [megaraid_disk_00], SCSI device
/dev/bus/0 -d megaraid,1 # /dev/bus/0 [megaraid_disk_01], SCSI device
/dev/bus/0 -d megaraid,2 # /dev/bus/0 [megaraid_disk_02], SCSI device
/dev/bus/0 -d megaraid,3 # /dev/bus/0 [megaraid_disk_03], SCSI device
/dev/bus/0 -d megaraid,4 # /dev/bus/0 [megaraid_disk_04], SCSI device
/dev/bus/0 -d megaraid,5 # /dev/bus/0 [megaraid_disk_05], SCSI device
/dev/bus/0 -d megaraid,6 # /dev/bus/0 [megaraid_disk_06], SCSI device

Here, I can see the readings of the disks temperature.

yeroslaviz commented 7 months ago

Any news about the EPYC's sensors?

Always occupied, sorry! Try this fix:

https://github.com/nobody43/zabbix-mini-IPMI/blob/01c8b27e77545ce501c67fad15d2339a00782997/Linux/mini_ipmi_lmsensors.py#L41

This apply to a different server with an EPYC processor.

I added the line to the list of reg expressions

CORES_REGEXPS = (
    ('Core(?:\s+)?(\d+):\n\s+temp\d+_input:\s+(\d+)'),
    ('Core(\d+)\s+Temp:\n\s+temp\d+_input:\s+(\d+)'),
    ('Tdie:\n\s+temp(\d+)_input:\s+(\d+)'),
    ('Tccd(\d+):\n\s+temp\d+_input:\s+(\d+)'),
    ('k\d+temp-pci-\w+\nAdapter:\s+PCI\s+adapter\ntemp(\d+):\n\s+temp\d+_input:\s+(\d+)'),
)

Now I can see the readings from the CPU and disks here. really, thanks a lot for all the help.

Is there a way to also see the temperature of the motherboard? What do I need to change for that?

nobody43 commented 7 months ago

For the Raid manual configuration - Do you mean, I should add all the raid HDDs into the line: e.g.

Try your output first, with less drives: diskDevsManual = ['/dev/bus/0 -d megaraid,0', '/dev/bus/0 -d megaraid,1']

Not sure what it means 🤔

Proper automatic fix won't happen soon.

Now I can see the readings from the CPU and disks here. really, thanks a lot for all the help.

Great! Congratulations!

Is there a way to also see the temperature of the motherboard?

Are you talking about bnxt_en-pci-*? Are you sure 70C is a reliable reading?

yeroslaviz commented 6 months ago

I don't know about the motherboard, but won't be surprised if it is correct. They get very hot sometimes.

I can now see the Disk temperatures, though it is still the default diskDevsManual = [] setting. I can't say, why it didn't work before. But, I'm not sure it shows the correct temperature. As it is always says at 29 degrees, for both servers, no fluctuations at all.

nobody43 commented 5 months ago

https://github.com/nobody43/zabbix-mini-IPMI/pull/83 Have you been able to try this?

I've looked into bnxt_en-pci-*, it's a network adapter(s). It appears your sensors output does not have motherboard sensors. sensors-detect might help with that.

nobody43 commented 3 months ago

Hope that's solved.