munin-monitoring / contrib

Contributed stuff for munin (plugins, tools, etc...)
http://munin-monitoring.org
1.05k stars 678 forks source link

disk/nvme broken after linux-nvme update #1373

Closed heeplr closed 1 year ago

heeplr commented 1 year ago

Recently I get lots of 2023/05/04 10:10:04 [ERROR] In RRD: Error updating /var/lib/munin/server/server-nvme_bytes-SN__dev_ng0n1_r-d.rrd: /var/lib/munin/server/server-nvme_bytes-SN__dev_ng0n1_r-d.rrd: not a simple signed integer: '83404309 (42.70 TB)' errors.

I suppose the human readable part in parentheses was added and now the parsing fails?

Highlighting @kjetilho @usrflo and @ap-wtioit since they are the contributors. I hope that's ok.

ap-wtioit commented 1 year ago

@heeplr can you post the output of sudo nvme list (this is parsed by the nvme plugin, we do not need the serial number column i guess as long as it contains only [A-Z0-9]+) and sudo nvme --version on your system?

should look something like this: sudo nvme list

Node             SN                   Model                                    Namespace Usage                      Format           FW Rev  
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     REDACTED     KINGSTON SA2000M81000G                   1         780,76  GB /   1,00  TB    512   B +  0 B   S5Z42105
/dev/nvme1n1     REDACTED      Samsung SSD 970 EVO 250GB                1         213,71  MB / 250,06  GB    512   B +  0 B   1B2QEXE7
/dev/nvme2n1     REDACTED Viper M.2 VPN100                         1         512,11  GB / 512,11  GB    512   B +  0 B   ECFM32.1

and sudo nvme --version

nvme version 1.5
heeplr commented 1 year ago

sure:

$ nvme list
Node                  Generic               SN                   Model                                    Namespace Usage                      Format           FW Rev  
--------------------- --------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          /dev/ng0n1            XXXXXXXXXX           TS128GMTE110S                            1         128.04  GB / 128.04  GB    512   B +  0 B   S1111B1L

$ nvme version
nvme version 2.3 (git 2.3)
libnvme version 1.3 (git 1.3)
ap-wtioit commented 1 year ago

thanks, seems like in newer versions of the nvme list there is an additional column (Generic) that we need to handle.

and may i ask you to also post the output of nvme smart-log /dev/nvme0n1 (which is used to read the bytes read and writen later, which is where the 83404309 (42.70 TB) would come from i guess) for us it looks like this: sudo nvme smart-log /dev/nvme0n1

Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 32 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 32%
data_units_read                     : 340.942.775
data_units_written                  : 668.066.040
host_read_commands                  : 3.967.315.592
host_write_commands                 : 5.752.338.519
controller_busy_time                : 80.133
power_cycles                        : 12
power_on_hours                      : 25.204
unsafe_shutdowns                    : 6
media_errors                        : 0
num_err_log_entries                 : 0
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count   : 0
Thermal Management T2 Trans Count   : 0
Thermal Management T1 Total Time    : 0
Thermal Management T2 Total Time    : 0

but i guess your values look something like the 83404309 (42.70 TB) in your log

heeplr commented 1 year ago

Here you go:

$ nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning                        : 0
temperature                             : 38°C (311 Kelvin)
available_spare                         : 100%
available_spare_threshold               : 10%
percentage_used                         : 11%
endurance group critical warning summary: 0
Data Units Read                         : 83404387 (42.70 TB)
Data Units Written                      : 26389074 (13.51 TB)
host_read_commands                      : 760160366
host_write_commands                     : 1223644607
controller_busy_time                    : 17651
power_cycles                            : 22
power_on_hours                          : 10032
unsafe_shutdowns                        : 11
media_errors                            : 0
num_err_log_entries                     : 0
Warning Temperature Time                : 0
Critical Composite Temperature Time     : 0
Thermal Management T1 Trans Count       : 0
Thermal Management T2 Trans Count       : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0
heeplr commented 1 year ago

Is there any perspective on an ETA for a fix?

kenyon commented 1 year ago

@heeplr there will be a fix when someone makes a pull request that fixes it.

ap-wtioit commented 1 year ago

Note: happening to us as well on a new debian bookworm server (and an Ubuntu 23.04 workstation), should be able to propose a fix soon (not promising anything)