influxdata / telegraf

Agent for collecting, processing, aggregating, and writing metrics, logs, and other arbitrary data.
https://influxdata.com/telegraf
MIT License
14.6k stars 5.56k forks source link

E! [inputs.ipmi_sensor]: Error in plugin: failed to run command #5190

Closed lee-costa closed 3 years ago

lee-costa commented 5 years ago

Once in a while my server will ramp up its fans and the inputs.ipmi_sensor plugin will throw this error.

E! [inputs.ipmi_sensor]: Error in plugin: failed to run command /usr/bin/ipmitool -H 192.168.139.8 -U root -P REDACTED -I lanplus sdr: signal: killed -

I only know it's happening because my server will ramp up the fans, almost as if its restarting itself but that's not the case.

My configuration:

# # Read metrics from the bare metal servers via IPMI
 [[inputs.ipmi_sensor]]

servers = [
    "root:REDACTED@lanplus(192.168.139.8)"
]
#
#   ## Recommended: use metric 'interval' that is a multiple of 'timeout' to avoid
#   ## gaps or overlap in pulled data
   interval = "30s"
#
#   ## Timeout for the ipmitool command to complete
   timeout = "20s" 

What I can't understand is if the server is dropping the connection or inputs.ipmi_sensor is timing out.

glinton commented 5 years ago

It was likely killed because the command was timing out. Are there multiple entries in the logs when that happens, or just the one? Is telegraf running on your server or a separate machine? It sounds like your server is killing the ipmi request (perhaps oom kill or other resource control) and you get the timeout.

Also, we should probably filter out the password from the logs and enhance that output to show if it was for sure a timeout.

lee-costa commented 5 years ago

@glinton Yes, multiples entries appear when it happens. Telegraf is running on another machine.

I will try to dig further but is this normal behavior? If indeed a timeout, is there anything I can do to stop it?

glinton commented 5 years ago

It doesn't sound like normal behavior, but I've only ever ipmi'd to remote servers and couldn't notice any fans spooling up. You could increase your timeout values and see if it limits it. You could also ensure your ipmi firmware is up to date and check the server logs if possible.

lee-costa commented 5 years ago

Server logs doesn't show any clues, only IPMI login/log out session messages. I also noticed that the server hosting telegraf is also throwing this error a few times a day.

To help clarify my setup a bit. I currently have two servers:

Server A - iDRAC x.139.9 Server B - iDRAC x.139.8

Server A has a VM running telegraf and uses IPMI to gather from itself. The VM from server A also uses IPMI to gather from server B.

servers = [
    "root: REDACTED@lanplus(192.168.139.8)",
    "root: REDACTED@lanplus(192.168.139.9)"

I will try increasing my timeout (double it?) and see if it helps. Apologies for not being able to provide more helpful details as I don't have much experience with this.

lee-costa commented 5 years ago

@glinton So I decided to check for updates for my server's iDRAC a few days ago and sure enough there was one. I updated it, increased the timeout as recommended and everything was fine for a few days but this morning I was greeted with this message:

E! [inputs.ipmi_sensor]: Error in plugin: failed to run command /usr/bin/ipmitool -H 192.168.139.9 -U root -P REDACTED -I lanplus sdr: exit status 1 - Error: Unable to establish IPMI v2 / RMCP+ session

I decided to also increase the interval so now I have this:

 interval = "60s"
 timeout = "30s"

The weird part is the server throwing this error is the server hosting the VM running telegraf. The remote servers I think only threw this error once.

I report back if increasing interval will help but just wanted your thoughts on this.

glinton commented 5 years ago

:thinking: Maybe it's a network/routing issue, but I doubt it...

danielnelson commented 5 years ago

We should change the log message to always report a timeout error if Telegraf signaled ipmitool.

lee-costa commented 5 years ago

Changing the interval and timeout helped a bit. Only saw the message today again (6 days has passed).

Thanks everyone for input.

lee-costa commented 5 years ago

@glinton I think I finally found the relevant errors in iDRAC:

` | DIS002 | Auto Discovery feature disabled. |  

| LOG007 | The previous log entry was repeated 1 times. |  

| RAC0182 | The iDRAC firmware was rebooted with the following reason: watchdog. |  

| USR0030 | Successfully logged in using root, from 192.168.139.105 and IPMI over LAN. |  

| IPA0100 | The iDRAC IP Address changed from 0.0.0.0 to 192.168.139.9. |  

| RAC0708 | Previous reboot was due to a firmware watchdog timeout. `

I am going to do a bit of research on these errors but wanted to update you on the fact that it may just be a firmware issue.

glinton commented 5 years ago

@junior466 were you ever able to confirm this was in fact a firmware issue or not? I believe we are still planning on improving the error log to show a timeout in any case.

lee-costa commented 5 years ago

@glinton My apologies I never had a chance to update you on the issue. I ended up contacting Dell and they helped me troubleshoot the issue since we were also seeing a watchdog error in the logs. We ended up with the conclusion that the interval was too short and was causing the timeout. We also updated the firmware on both servers (one version behind) just in case. I ended up with an interval of 120s and haven't seem the problem in quite some time. It's safe to assume this was causing my issue.

glinton commented 5 years ago

Excellent, thanks so much for the update!