prometheus / node_exporter

Exporter for machine metrics
https://prometheus.io/
Apache License 2.0
11.2k stars 2.36k forks source link

Unraid OS/md driver causes console error; error parsing mdstatus: error parsing mdstat "/proc/mdstat" #2642

Open codefaux opened 1 year ago

codefaux commented 1 year ago

Host operating system:

Linux Tower 5.19.17-Unraid #2 SMP PREEMPT_DYNAMIC Wed Nov 2 11:54:15 PDT 2022 x86_64 Intel(R) Xeon(R) CPU           X5680  @ 3.33GHz GenuineIntel GNU/Linux

FWIW it's Unraid 6.11.x series

node_exporter version:

node_exporter, version 1.5.0 (branch: HEAD, revision: 1b48970ffcf5630534fb00bb0687d73c66d1c959)
  build user:       root@f7d6f307fe40
  build date:       20230226-16:34:47
  go version:       go1.19.1
  platform:         linux/amd64

node_exporter command line flags

None

node_exporter log output

Snipping all except this relevant line -- pretty sure the rest isn't significant given it seems to be a parsing-related issue with proc output of my system. Please correct me if I'm wrong and I'll paste the entire output.

ts=2023-03-25T22:09:10.616Z caller=collector.go:169 level=error msg="collector failed" name=mdadm duration_seconds=0.049607847 err="error parsing mdstatus: error parsing mdstat \"/proc/mdstat\": not enough fields in mdline (expected at least 3): sbName=/boot/config/super.dat"

Are you running node_exporter in Docker?

Nope, direct from terminal as foreground process for troubleshooting purposes

What did you do that exposed an error?

Installed exporter via OS plugin (Prometheus Node Exporter) and noticed errors in log output, ran executable directly to narrow replication steps.

What did you expect to see?

Fewer errors, probably zero

What did you see instead?

An entire error, every time Prometheus polled it

I'm trying to be humorous with the last two, ideally it doesn't just tick someone off. Anyway, since it'll come up; I want to share mdstat but it's full to the brim with in-warranty serial numbers and I've been cautioned against exposing that to the public. I've removed the lines with valid disk data. I expect the one causing trouble is number 29, rdevStatus.29=DISK_NP_DSBL -- I did not edit any lines in 29's block. I don't know what that disk is, my OS (Unraid) manages md and related systems for me.

sbName=/boot/config/super.dat
sbVersion=2.9.13
sbCreated=1621491969
sbUpdated=1678709639
sbEvents=316
sbState=1
sbNumDisks=30
sbSynced=1678683601
sbSynced2=1678709639
sbSyncErrs=0
sbSyncExit=0
mdVersion=2.9.25
mdState=STARTED
mdNumDisks=29
mdNumDisabled=1
mdNumReplaced=0
mdNumInvalid=1
mdNumMissing=0
mdNumWrong=0
mdNumNew=0
mdSwapP=0
mdSwapQ=0
mdResyncAction=check P
mdResyncSize=15625879500
mdResyncCorr=1
mdResync=0
mdResyncPos=0
mdResyncDt=0
mdResyncDb=0
diskNumber.0=0
diskName.0=
diskSize.0=15625879500
diskState.0=7
rdevNumber.0=0
rdevStatus.0=DISK_OK
rdevName.0=sdo
rdevOffset.0=64
rdevSize.0=15625879500
rdevReads.0=4360982638
rdevWrites.0=2218682115
rdevNumErrors.0=0
diskNumber.1=1
diskName.1=md1
diskSize.1=2930266532
diskState.1=7
rdevNumber.1=1
rdevStatus.1=DISK_OK
rdevName.1=sdd
rdevOffset.1=64
rdevSize.1=2930266532
rdevReads.1=869989188
rdevWrites.1=9419909
rdevNumErrors.1=0
diskNumber.2=2
diskName.2=md2
diskSize.2=7814026532
diskState.2=7
rdevNumber.2=2
rdevStatus.2=DISK_OK
rdevName.2=sdae
rdevOffset.2=64
rdevSize.2=7814026532
rdevReads.2=2827769379
rdevWrites.2=104442749
rdevNumErrors.2=0
diskNumber.3=3
diskName.3=md3
diskSize.3=2930266532
diskState.3=7
rdevNumber.3=3
rdevStatus.3=DISK_OK
rdevName.3=sdb
rdevOffset.3=64
rdevSize.3=2930266532
rdevReads.3=851517873
rdevWrites.3=79011396
rdevNumErrors.3=0
diskNumber.4=4
diskName.4=md4
diskSize.4=2930266532
diskState.4=7
rdevNumber.4=4
rdevStatus.4=DISK_OK
rdevName.4=sdi
rdevOffset.4=64
rdevSize.4=2930266532
rdevReads.4=878982648
rdevWrites.4=94610
rdevNumErrors.4=0
diskNumber.5=5
diskName.5=md5
diskSize.5=2930266532
diskState.5=7
rdevNumber.5=5
rdevStatus.5=DISK_OK
rdevName.5=sdh
rdevOffset.5=64
rdevSize.5=2930266532
rdevReads.5=883408419
rdevWrites.5=105519
rdevNumErrors.5=0
diskNumber.6=6
diskName.6=md6
diskSize.6=7814026532
diskState.6=7
rdevNumber.6=6
rdevStatus.6=DISK_OK
rdevName.6=sdp
rdevOffset.6=64
rdevSize.6=7814026532
rdevReads.6=2855351764
rdevWrites.6=224479
rdevNumErrors.6=0
diskNumber.7=7
diskName.7=md7
diskSize.7=2930266532
diskState.7=7
rdevNumber.7=7
rdevStatus.7=DISK_OK
rdevName.7=sdf
rdevOffset.7=64
rdevSize.7=2930266532
rdevReads.7=877619351
rdevWrites.7=15662
rdevNumErrors.7=0
diskNumber.8=8
diskName.8=md8
diskSize.8=2930266532
diskState.8=7
rdevNumber.8=8
rdevStatus.8=DISK_OK
rdevName.8=sdl
rdevOffset.8=64
rdevSize.8=2930266532
rdevReads.8=878063909
rdevWrites.8=38031
rdevNumErrors.8=0
diskNumber.9=9
diskName.9=md9
diskSize.9=7814026532
diskState.9=7
rdevNumber.9=9
rdevStatus.9=DISK_OK
rdevName.9=sdk
rdevOffset.9=64
rdevSize.9=7814026532
rdevReads.9=2863940749
rdevWrites.9=122655
rdevNumErrors.9=0
diskNumber.10=10
diskName.10=md10
diskSize.10=15625879500
diskState.10=7
rdevNumber.10=10
rdevStatus.10=DISK_OK
rdevName.10=sdj
rdevOffset.10=64
rdevSize.10=15625879500
rdevReads.10=5181093302
rdevWrites.10=1921385118
rdevNumErrors.10=0
diskNumber.11=11
diskName.11=md11
diskSize.11=2930266532
diskState.11=7
rdevNumber.11=11
rdevStatus.11=DISK_OK
rdevName.11=sdab
rdevOffset.11=64
rdevSize.11=2930266532
rdevReads.11=886676933
rdevWrites.11=77065
rdevNumErrors.11=0
diskNumber.12=12
diskName.12=md12
diskSize.12=2930266532
diskState.12=7
rdevNumber.12=12
rdevStatus.12=DISK_OK
rdevName.12=sdaa
rdevOffset.12=64
rdevSize.12=2930266532
rdevReads.12=894882298
rdevWrites.12=93540
rdevNumErrors.12=0
diskNumber.13=13
diskName.13=md13
diskSize.13=2930266532
diskState.13=7
rdevNumber.13=13
rdevStatus.13=DISK_OK
rdevName.13=sdz
rdevOffset.13=64
rdevSize.13=2930266532
rdevReads.13=888452743
rdevWrites.13=44056
rdevNumErrors.13=0
diskNumber.14=14
diskName.14=md14
diskSize.14=2930266532
diskState.14=7
rdevNumber.14=14
rdevStatus.14=DISK_OK
rdevName.14=sdy
rdevOffset.14=64
rdevSize.14=2930266532
rdevReads.14=885679393
rdevWrites.14=84261
rdevNumErrors.14=0
diskNumber.15=15
diskName.15=md15
diskSize.15=2930266532
diskState.15=7
rdevNumber.15=15
rdevStatus.15=DISK_OK
rdevName.15=sdt
rdevOffset.15=64
rdevSize.15=2930266532
rdevReads.15=893711655
rdevWrites.15=75311
rdevNumErrors.15=0
diskNumber.16=16
diskName.16=md16
diskSize.16=2930266532
diskState.16=7
rdevNumber.16=16
rdevStatus.16=DISK_OK
rdevName.16=sdx
rdevOffset.16=64
rdevSize.16=2930266532
rdevReads.16=883541921
rdevWrites.16=57182
rdevNumErrors.16=0
diskNumber.17=17
diskName.17=md17
diskSize.17=15625879500
diskState.17=7
rdevNumber.17=17
rdevStatus.17=DISK_OK
rdevName.17=sdr
rdevOffset.17=64
rdevSize.17=15625879500
rdevReads.17=5632680065
rdevWrites.17=102142969
rdevNumErrors.17=0
diskNumber.18=18
diskName.18=md18
diskSize.18=7814026532
diskState.18=7
rdevNumber.18=18
rdevStatus.18=DISK_OK
rdevName.18=sdq
rdevOffset.18=64
rdevSize.18=7814026532
rdevReads.18=2858126032
rdevWrites.18=166982
rdevNumErrors.18=0
diskNumber.19=19
diskName.19=md19
diskSize.19=2930266532
diskState.19=7
rdevNumber.19=19
rdevStatus.19=DISK_OK
rdevName.19=sdw
rdevOffset.19=64
rdevSize.19=2930266532
rdevReads.19=879207839
rdevWrites.19=68807
rdevNumErrors.19=0
diskNumber.20=20
diskName.20=md20
diskSize.20=2930266532
diskState.20=7
rdevNumber.20=20
rdevStatus.20=DISK_OK
rdevName.20=sdv
rdevOffset.20=64
rdevSize.20=2930266532
rdevReads.20=877058994
rdevWrites.20=24740
rdevNumErrors.20=0
diskNumber.21=21
diskName.21=md21
diskSize.21=2930266532
diskState.21=7
rdevNumber.21=21
rdevStatus.21=DISK_OK
rdevName.21=sdu
rdevOffset.21=64
rdevSize.21=2930266532
rdevReads.21=881441426
rdevWrites.21=35913
rdevNumErrors.21=0
diskNumber.22=22
diskName.22=md22
diskSize.22=2930266532
diskState.22=7
rdevNumber.22=22
rdevStatus.22=DISK_OK
rdevName.22=sdm
rdevOffset.22=64
rdevSize.22=2930266532
rdevReads.22=877528002
rdevWrites.22=65843
rdevNumErrors.22=0
diskNumber.23=23
diskName.23=md23
diskSize.23=2930266532
diskState.23=7
rdevNumber.23=23
rdevStatus.23=DISK_OK
rdevName.23=sde
rdevOffset.23=64
rdevSize.23=2930266532
rdevReads.23=973091474
rdevWrites.23=106433
rdevNumErrors.23=0
diskNumber.24=24
diskName.24=md24
diskSize.24=5860522532
diskState.24=7
rdevNumber.24=24
rdevStatus.24=DISK_OK
rdevName.24=sdg
rdevOffset.24=64
rdevSize.24=5860522532
rdevReads.24=2100070271
rdevWrites.24=196029
rdevNumErrors.24=0
diskNumber.25=25
diskName.25=md25
diskSize.25=7814026532
diskState.25=7
rdevNumber.25=25
rdevStatus.25=DISK_OK
rdevName.25=sdn
rdevOffset.25=64
rdevSize.25=7814026532
rdevReads.25=2877810140
rdevWrites.25=133532
rdevNumErrors.25=0
diskNumber.26=26
diskName.26=md26
diskSize.26=5860522532
diskState.26=7
rdevNumber.26=26
rdevStatus.26=DISK_OK
rdevName.26=sdac
rdevOffset.26=64
rdevSize.26=5860522532
rdevReads.26=2017623077
rdevWrites.26=133883
rdevNumErrors.26=0
diskNumber.27=27
diskName.27=md27
diskSize.27=5860522532
diskState.27=7
rdevNumber.27=27
rdevStatus.27=DISK_OK
rdevName.27=sds
rdevOffset.27=64
rdevSize.27=5860522532
rdevReads.27=2069466478
rdevWrites.27=168701
rdevNumErrors.27=0
diskNumber.28=28
diskName.28=md28
diskSize.28=7814026532
diskState.28=7
rdevNumber.28=28
rdevStatus.28=DISK_OK
rdevName.28=sdad
rdevOffset.28=64
rdevSize.28=7814026532
rdevReads.28=2856075724
rdevWrites.28=146827
rdevNumErrors.28=0
diskNumber.29=29
diskName.29=
diskSize.29=0
diskState.29=4
diskId.29=
rdevNumber.29=29
rdevStatus.29=DISK_NP_DSBL
rdevName.29=
rdevOffset.29=0
rdevSize.29=0
rdevId.29=
rdevReads.29=0
rdevWrites.29=0
rdevNumErrors.29=0
discordianfish commented 1 year ago

hum.. your mdstat looks entirely different from what we expect: https://github.com/prometheus/procfs/blob/b4a1860af088340784210567cf7d5be61ff0d12b/testdata/fixtures.ttar#L2255

Is this really a snippet of /proc/mdstat?

codefaux commented 1 year ago

Really, it actually is. I'm as surprised as you. image

codefaux commented 1 year ago

Or at least I was... I've been informed that Unraid does not use the standard md module. I apologize; I only learned that literally just now.

There are many Unraid users, and many of them use this Prometheus Node Exporter.

I would LOVE to see support for Unraid's md module in the exporter so I can get proper metrics, but this is clearly not a "mainline use scenario" so I understand if the decision is made not to support it.

As of right now, the suggestion has been raised to simply disable the md collector via --no-collector.mdadm -- this is obviously a valid workaround since the exporter functions in its current state otherwise.

Just so I can get a concrete answer; is there any intent to (eventually, no target date asked) support Unraid's md module, or should we consider the workaround permanent?

discordianfish commented 1 year ago

Interesting! I mean I'm not opposed to support unraid if someone submits clear PR(s) for this:

codefaux commented 1 year ago

Roger. I'll consider tackling it myself, but my work isn't typically up to scale wirth what's going on here. We'll see how it goes, thanks for your time.

Close or leave open?

SuperQ commented 1 year ago

We can leave this open for tracking.

dswarbrick commented 1 year ago

Are the contents of /sys/block/md*/md closer to that of the traditional md module? There was an github issue floating around a few years ago about migrating the mdraid parser from /proc/mdstat to sysfs.

codefaux commented 1 year ago

I'm guessing no, since /sys/block/md*/md does not exist - unless I misunderstand.

image

dswarbrick commented 1 year ago

I guess it stands to reason that if they're not using the standard Linux mdraid module, the associated sysfs entries won't be there either.

It looks like you're not the first person to encounter this: https://forums.unraid.net/bug-reports/stable-releases/692-node-exporter-cant-parse-mdstat-to-get-disk-information-for-prometheus-r1638/