networkupstools / nut

The Network UPS Tools repository. UPS management protocol Informational RFC 9271 published by IETF at https://www.rfc-editor.org/info/rfc9271 Please star NUT on GitHub, this helps with sponsorships!
https://networkupstools.org/
Other
2.07k stars 350 forks source link

nut spamming syslog with Input/Output Errors #2072

Open xorbug opened 1 year ago

xorbug commented 1 year ago

Hi, as per the title, my syslog is getting massively spammed many times every few minutes like:

...
2023-09-21T11:33:32.769753+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
2023-09-21T11:33:32.770158+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
2023-09-21T11:33:32.770307+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
2023-09-21T11:33:32.770842+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
2023-09-21T11:33:32.771341+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
2023-09-21T11:40:27.072224+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
2023-09-21T11:40:27.074871+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
2023-09-21T11:40:27.075033+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
2023-09-21T11:40:27.075167+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
2023-09-21T11:40:27.075295+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
2023-09-21T11:40:27.075424+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
2023-09-21T11:40:59.078712+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
2023-09-21T11:40:59.079706+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
2023-09-21T11:40:59.079898+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
2023-09-21T11:40:59.080046+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
2023-09-21T11:40:59.080319+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
2023-09-21T11:40:59.081163+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
2023-09-21T11:45:47.355782+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
2023-09-21T11:45:47.358632+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
2023-09-21T11:45:47.358878+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
2023-09-21T11:45:47.359027+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
2023-09-21T11:45:47.359159+02:00 <hostname> usbhid-ups[11214]: nut_libusb_get_report: Input/Output Error
...

Maybe similar to #1983, #1986.

I couldn't relate this messages to any triggering event (yet?). Here are my ups details for reference, as reported by upsc:

battery.charge: 100
battery.charge.warning: 30
battery.runtime: 2811
battery.voltage: 13.5
device.mfr: Legrand  
device.model: Legrand UPS  
device.type: ups
driver.name: usbhid-ups
driver.parameter.bus: 001
driver.parameter.pollfreq: 30
driver.parameter.pollinterval: 2
driver.parameter.port: auto
driver.parameter.product: Legrand UPS
driver.parameter.productid: 0032
driver.parameter.synchronous: auto
driver.parameter.vendor: Legrand
driver.parameter.vendorid: 1CB0
driver.version: 2.8.0
driver.version.data: Legrand HID 0.2
driver.version.internal: 0.47
driver.version.usb: libusb-1.0.26 (API: 0x1000109)
input.voltage: 175.0
output.voltage: 2160.0
ups.delay.shutdown: 20
ups.delay.start: 30
ups.load: 0
ups.mfr: Legrand  
ups.model: Legrand UPS  
ups.productid: 0032
ups.realpower.nominal: 800
ups.status: OL CHRG
ups.vendorid: 1cb0

Thanks.

jimklimov commented 1 year ago

Well, it does look like those issues... could you check if a physical or software (usbreset) reconnection helps? Are there syslog or dmesg events that seem relevant just before it would start spamming?

jimklimov commented 1 year ago

At least this does not seem to be a spam flood that would bring the server down (by cpu load or storage), with bursts of a dozen messages every 5-7 minutes here. It is irregular, right?

xorbug commented 1 year ago

Well, it does look like those issues... could you check if a physical or software (usbreset) reconnection helps? Are there syslog or dmesg events that seem relevant just before it would start spamming?

I'm sure I tried this already without success, but here is a more thorough test.

in both cases can't see any relevant event in system logs.

At least this does not seem to be a spam flood that would bring the server down (by cpu load or storage), with bursts of a dozen messages every 5-7 minutes here. It is irregular, right?

It's irregularly constant and continuous, yes (like bursts of about 5 messages anytime between 30 sec. to 10 min. roughly, I ignore the exact values but you get the idea).

EDIT: To expand on the last point, in case it's useful for investigating the issue, I attach a file showing the bursts timestamp and the number of messages in them (obtained by journalctl --boot 0 | grep 'nut_libusb_get_report: Input/Output Error' | cut -d ' ' -f 3 | uniq -c, to which I also appended the time delta between two consecutive bursts to make that clearer). Still no apparent correlation to any triggering event tho. nut_libusb_get_report+delta.log

xorbug commented 1 year ago

Turns out that as long as this condition is ongoing and the syslog is getting spammed, the automatic shutdown sequence doesn't work. In fact upsd -c fsd only shuts down the server and not the ups, and that should be because:

# upsdrvctl shutdown
Network UPS Tools - UPS driver controller 2.8.0
Network UPS Tools - Generic HID driver 0.47 (2.8.0)
USB communication driver (libusb 1.0) 0.43
interrupt pipe disabled (add 'pollonly' flag to 'ups.conf' to get rid of this message)
Can't claim USB device [1cb0:0032]@0/0: Entity not found
Driver failed to start (exit status=1)

Then it only works after a restart of the nut stack, or an usbreset, even if in the latter case the logs are then spammed by:

usb 1-3: usbfs: process 861 (usbhid-ups) did not claim interface 0 before use

So my temporary workaround for this has been setting SHUTDOWNCMD to usbreset <device> && poweroff. This should be enough to make the call to upsdrvctl shutdown succeed during the last stages of the shutdown procedure. Probably not 100% reliable (it's somewhat of a race condition anyway before the issue starts again) but seems to work... better than nothing.

(Always take this with a grain of salt as I couldn't carry out extensive tests for obvious reasons, but I could confirm all this quite a bit of times in a row yes.)