networkupstools / nut

The Network UPS Tools repository. UPS management protocol Informational RFC 9271 published by IETF at https://www.rfc-editor.org/info/rfc9271 Please star NUT on GitHub, this helps with sponsorships!
https://networkupstools.org/
Other
2.08k stars 351 forks source link

Spurious usbhid-ups ioctl error, fails to recover, dies. #414

Open svarshavchik opened 7 years ago

svarshavchik commented 7 years ago

With nut 2.7.4, I'm seeing usbhid-ups randomly crashing, requiring a restart.

I attached strace to the running process, when it is running, in order to capture what's going on. This is what I see occasionally happening:

ioctl(10, USBDEVFS_REAPURBNDELAY, 0x7ffd4d210fe0) = 0 timerfd_settime(9, 0, {it_interval={tv_sec=0, tv_nsec=0}, it_value={tv_sec=0, tv_nsec=0}}, NULL) = 0 ioctl(10, USBDEVFS_REAPURBNDELAY, 0x7ffd4d210fe0) = -1 EAGAIN (Resource temporarily unavailable) sendto(3, "<31>Apr 1 12:46:10 usbhid-ups[4983]: libusb_get_report: Input/output error", 75, MSG_NOSIGNAL, NULL, 0) = 75

An EAGAIN on this ioctl gets treated as an error. Execution continues:

select(13, [11 12], NULL, NULL, {tv_sec=1, tv_usec=901487}) = 0 (Timeout) recvmsg(4, {msg_namelen=128}, 0) = -1 EAGAIN (Resource temporarily unavailable) recvmsg(4, {msg_namelen=128}, 0) = -1 EAGAIN (Resource temporarily unavailable) open("/dev/bus/usb/002/004", O_RDWR) = -1 EACCES (Permission denied) open("/dev/bus/usb/002/003", O_RDWR) = -1 EACCES (Permission denied) open("/dev/bus/usb/002/001", O_RDWR) = -1 EACCES (Permission denied) open("/dev/bus/usb/008/001", O_RDWR) = -1 EACCES (Permission denied) open("/dev/bus/usb/007/001", O_RDWR) = -1 EACCES (Permission denied) open("/dev/bus/usb/006/002", O_RDWR) = 13 ioctl(13, USBDEVFS_GET_CAPABILITIES, 0x558676cb1dc4) = 0 write(8, "\1", 1) = 1

/dev/bus/usb/006/002 gets reopened. This is my UPS device. But, the subsequent initialization fails. There's a bunch ioctl on file descriptor 13, then:

ioctl(13, USBDEVFS_GETDRIVER, 0x7ffd4d20f830) = 0 ioctl(13, USBDEVFS_CLAIMINTERFACE, 0x7ffd4d20f90c) = -1 EBUSY (Device or resource busy) ioctl(13, USBDEVFS_GETDRIVER, 0x7ffd4d20f830) = 0 sendto(3, "<27>Apr 1 12:46:12 usbhid-ups[4983]: Can't claim USB device [09ae:3016]: No such file or directory", 99, MSG_NOSIGNAL, NULL, 0) = 99 write(8, "\1", 1) = 1 close(13) = 0 unlink("/var/run/nut/usbhid-ups-nutdev1.pid") = 0 close(11) = 0 unlink("/var/run/nut/usbhid-ups-nutdev1") = 0 close(12) = 0 exit_group(1) = ? +++ exited with 1 +++

From what I see, the original file descriptor 10 was never closed. The USB device was reopened as file descriptor 13, and the initialization failed, apparently because the USB device is already claimed by file descriptor 10.

The UPS is:

Bus 006 Device 002: ID 09ae:3016 Tripp Lite Device Descriptor: bLength 18 bDescriptorType 1 bcdUSB 1.10 bDeviceClass 0 bDeviceSubClass 0 bDeviceProtocol 0 bMaxPacketSize0 8 idVendor 0x09ae Tripp Lite idProduct 0x3016 bcdDevice 0.02 iManufacturer 3 Tripp Lite iProduct 1 TRIPP LITE UPS iSerial 5 2643FVLSM871900817 bNumConfigurations 1 Configuration Descriptor: bLength 9 bDescriptorType 2 wTotalLength 34 bNumInterfaces 1 bConfigurationValue 1 iConfiguration 0 bmAttributes 0xa0 (Bus Powered) Remote Wakeup MaxPower 100mA Interface Descriptor: bLength 9 bDescriptorType 4 bInterfaceNumber 0 bAlternateSetting 0 bNumEndpoints 1 bInterfaceClass 3 Human Interface Device bInterfaceSubClass 0 bInterfaceProtocol 0 iInterface 0 HID Device Descriptor: bLength 9 bDescriptorType 33 bcdHID 1.11 bCountryCode 0 Not supported bNumDescriptors 1 bDescriptorType 34 Report wDescriptorLength 878 Report Descriptors: UNAVAILABLE Endpoint Descriptor: bLength 7 bDescriptorType 5 bEndpointAddress 0x81 EP 1 IN bmAttributes 3 Transfer Type Interrupt Synch Type None Usage Type Data wMaxPacketSize 0x0008 1x 8 bytes bInterval 12 Device Status: 0x0000 (Bus Powered)

aquette commented 7 years ago

hi @svarshavchik I saw that you also posted on the NUT users list, mentioning that you also use XFCE's Power Manager. Is this a change of behavior? Or is it a new device? I'd like to have a driver debug trace if possible, since I'm yet failing to see how to address that from a driver perspective.

svarshavchik commented 7 years ago

This is a new device. I eventually squared away the other permissions-related issue, but could not resolve this issue -- with the driver terminating, requiring a restart.

I presume you want to look at the output of the driver when it is started with the -D option?

aquette commented 7 years ago

thanks for confirming that it's not a regression. I indeed want to have a look at the driver output with -DDDDD (debug level 5), limiting to an excerpt when this issue occurs please. Would you be able to also test the new libusb1.0 branch to see the resulting behavior? And possibly a patch (github branch) to test some fix?

svarshavchik commented 7 years ago

I'll start with trying to capture the driver output, first.

I'm running Fedora-packaged nut 2.7.4.It would actually be easier for me to apply a patch to the Fedora build; but I should be able to create a new build too, it'll just take a little bit of time.

svarshavchik commented 7 years ago

/usr/sbin/usbhid-ups -h

Network UPS Tools - Generic HID driver 0.41 (2.7.4) USB communication driver 0.33

...

/usr/sbin/usbhid-ups -DDDDD -a nutdev1

...

40.757119 hu_find_infoval: found !replacebatt (value: 0) 40.757125 process_boolean_info: !replacebatt 40.757134 Entering libusb_get_report 40.760539 Report[get]: (3 bytes) => 22 20 00 40.760557 PhyMax = 0, PhyMin = 0, LogMax = 1, LogMin = 0 40.760565 Unit = 00000000, UnitExp = 0 40.760573 Exponent = 0 40.760582 Path: UPS.PowerConverter.PresentStatus.VoltageOutOfRange, Type: Feature, ReportID: 0x22, Offset: 0, Size: 1, Value: 0 40.760591 hu_find_infoval: found !vrange (value: 0) 40.760646 process_boolean_info: !vrange 40.760657 Report[buf]: (3 bytes) => 22 20 00 40.760690 PhyMax = 0, PhyMin = 0, LogMax = 1, LogMin = 0 40.760698 Unit = 00000000, UnitExp = 0 40.760703 Exponent = 0 40.760714 Path: UPS.PowerConverter.PresentStatus.Buck, Type: Feature, ReportID: 0x22, Offset: 1, Size: 1, Value: 0 40.760722 hu_find_infoval: found !trim (value: 0) 40.760727 process_boolean_info: !trim 40.760733 Report[buf]: (3 bytes) => 22 20 00 40.760741 PhyMax = 0, PhyMin = 0, LogMax = 1, LogMin = 0 40.760749 Unit = 00000000, UnitExp = 0 40.760757 Exponent = 0 40.760767 Path: UPS.PowerConverter.PresentStatus.Boost, Type: Feature, ReportID: 0x22, Offset: 2, Size: 1, Value: 0 40.760774 hu_find_infoval: found !boost (value: 0) 40.760785 process_boolean_info: !boost 40.760794 Report[buf]: (3 bytes) => 22 20 00 40.760800 PhyMax = 0, PhyMin = 0, LogMax = 1, LogMin = 0 40.760808 Unit = 00000000, UnitExp = 0 40.760815 Exponent = 0 40.760822 Path: UPS.PowerConverter.PresentStatus.Overload, Type: Feature, ReportID: 0x22, Offset: 4, Size: 1, Value: 0 40.760830 hu_find_infoval: found !overload (value: 0) 40.760835 process_boolean_info: !overload 40.760841 Report[buf]: (3 bytes) => 22 20 00 40.760848 PhyMax = 0, PhyMin = 0, LogMax = 1, LogMin = 0 40.760856 Unit = 00000000, UnitExp = 0 40.760863 Exponent = 0 40.760869 Path: UPS.PowerConverter.PresentStatus.OverTemperature, Type: Feature, ReportID: 0x22, Offset: 6, Size: 1, Value: 0 40.760876 hu_find_infoval: found !overheat (value: 0) 40.760883 process_boolean_info: !overheat 40.760891 Report[buf]: (3 bytes) => 22 20 00 40.760900 PhyMax = 0, PhyMin = 0, LogMax = 1, LogMin = 0 40.760907 Unit = 00000000, UnitExp = 0 40.760914 Exponent = 0 40.760923 Path: UPS.PowerConverter.PresentStatus.InternalFailure, Type: Feature, ReportID: 0x22, Offset: 7, Size: 1, Value: 0 40.760932 hu_find_infoval: found !commfault (value: 0) 40.760939 process_boolean_info: !commfault 40.760950 Report[buf]: (3 bytes) => 22 20 00 40.760959 PhyMax = 0, PhyMin = 0, LogMax = 1, LogMin = 0 40.760966 Unit = 00000000, UnitExp = 0 40.760972 Exponent = 0 40.760981 Path: UPS.PowerConverter.PresentStatus.AwaitingPower, Type: Feature, ReportID: 0x22, Offset: 14, Size: 1, Value: 0 40.760989 hu_find_infoval: found !awaitingpower (value: 0) 40.760995 process_boolean_info: !awaitingpower 40.761003 Entering libusb_get_report 40.764551 Report[get]: (3 bytes) => 31 be 04 40.764569 PhyMax = 0, PhyMin = 0, LogMax = 65535, LogMin = 0 40.764579 Unit = 00f0d121, UnitExp = 1 40.764586 Exponent = -6 40.764597 Path: UPS.PowerSummary.Input.Voltage, Type: Feature, ReportID: 0x31, Offset: 0, Size: 16, Value: 0.001214 40.764609 Entering libusb_get_report 40.768542 libusb_get_report: Input/output error 40.768563 Can't retrieve Report 19: Input/output error 47.217845 send_to_one: PONG 50.711367 upsdrv_updateinfo... 50.711404 Got to reconnect!

50.711411 ================================================== 50.711416 = device has been disconnected, try to reconnect = 50.711421 ================================================== 50.711476 Checking device (04A9/26B5) (002/004) 50.711553 Failed to open device, skipping. (Permission denied) 50.711566 Checking device (043D/022C) (002/005) 50.711597 Failed to open device, skipping. (Permission denied) 50.711607 Checking device (1D6B/0002) (002/001) 50.711620 Failed to open device, skipping. (Permission denied) 50.711628 Checking device (1D6B/0001) (008/001) 50.711645 Failed to open device, skipping. (Permission denied) 50.711654 Checking device (1D6B/0001) (007/001) 50.711667 Failed to open device, skipping. (Permission denied) 50.711674 Checking device (09AE/3016) (006/002) 50.785591 - VendorID: 09ae 50.785620 - ProductID: 3016 50.785626 - Manufacturer: Tripp Lite 50.785631 - Product: TRIPP LITE UPS 50.785636 - Serial Number: 2643FVLSM871900817 50.785642 - Bus: 006 50.785646 - Device release number: 0002 50.785652 Trying to match device 50.785685 Device matches 50.785701 failed to claim USB device: Device or resource busy 50.785740 failed to detach kernel driver from USB device: No such file or directory 50.785761 failed to claim USB device: Device or resource busy 50.785771 failed to detach kernel driver from USB device: No such file or directory 50.785780 failed to claim USB device: Device or resource busy 50.785787 failed to detach kernel driver from USB device: No such file or directory 50.785795 failed to claim USB device: Device or resource busy 50.785805 failed to detach kernel driver from USB device: No such file or directory 50.785814 Can't claim USB device [09ae:3016]: No such file or directory 50.785825 upsdrv_cleanup...

aquette commented 7 years ago

Attached a basic test patch (renamed to .txt to please github) It simply tries to close the handle prior to reconnecting... just to get a first level of visibility

usbhid-closehandle.txt

svarshavchik commented 7 years ago

I got a mixed bag of results.

The patch didn't build because it appears to be for libusb1.0. After poking around the headers, I changed it to usb_close(). Additionally I had to remove the mge-shut driver from the makefile because it also uses usbhid-ups.c but doesn't link with libusb, so that also needed fixing. Then everything built.

But, the first two times I tried to start the driver, it failed to start for some strange reason. First time that happened, but this does not look like it's related to the patch, likely to be a separate issue. Fortunately I was logging everything, and I'm attaching log1.txt, which is all I have. The third time was the charm, and the driver started. This looks like, maybe, a different problem.

log1.txt

log2.txt is later on, the first time it reported a timeout like it did before, but with the patch it appeared to be able to recover, and continued plugging along.

log2.txt

Satisfied that it survived a restart, I turned off the debugging and let it run normally. I just checked on it, and I see that the process was restarted after a few hours.

Searching through syslog, it looks like it was restarted due to a segfault, see syslog.txt. However syslog.txt is all that I captured, but the segfault does appear to be related. It segfault shortly after it started reporting timeouts:

syslog.txt

Also, it looks like the driver had trouble starting up again, like it did for me initially, but it eventually got started.

aquette commented 7 years ago

hi @svarshavchik sorry for the faulty patch, I indeed made it on 2.7.4, but generated it from a master in-between testing / patching on libusb1.0 branch and the removal of the oldmge-shut...

as for your issues:

Attached is another patch, against gitmaster, hope it applies without too much issues: usbhid-improvements.txt

Would be good if you could test against the libusb1.0 branch too, to see how native new code behaves compared to the venerable libusb0 (or maybe libusb-compat, not sure, but we'll have the visibility). I can check to make a patch too, just tell me...

svarshavchik commented 7 years ago

This latest patch applied and built cleanly.

It's been running for ten hours without a segfault. With this patch, I'm still logging timeouts regularly:

1921.642157 libusb_get_report: Input/output error 1921.642205 Can't retrieve Report 02: Input/output error

and

2461.863408 libusb_get_report: Input/output error 2461.863429 Can't retrieve Report 22: Input/output error

I count 150 of these over the last ten hours, or about one every four minutes, on average. Although it always reconnects without issues, I can't say if this is the expected behavior, or not. I have no prior knowledge of libusb; but my strace showed the disconnects happening as a consequence of EAGAIN, which is something that looks like it can be retried immediately, instead of winding things down, and taking it from step 1.

I can try building a patched libusb1.0. I'll need some time for this. It was much simpler to just add a patch to the current Fedora build, than to create a new build from scratch.

svarshavchik commented 7 years ago

Yeah, I'll need some time to get the libusb-1.0 branch building...

Makefile.am: installing './INSTALL' clients/Makefile.am: installing './depcomp' configure.ac:1526: error: required file 'scripts/augeas/nutupsconf.aug.in' not found configure.ac:1526: error: required file 'scripts/devd/nut-usb.conf.in' not found configure.ac:1526: error: required file 'scripts/udev/nut-usbups.rules.in' not found parallel-tests: installing './test-driver' autoreconf: automake failed with exit status: 1

On that branch, configure.ac references files that are not in git.

aquette commented 7 years ago

You need to call the auto gen.should first which generates these files ;)

aquette commented 7 years ago

Complementary comments:

svarshavchik commented 7 years ago

Managed to build the libusb-1.0 branch on Fedora. I should have some useful data after I let it run for a while. In the meantime here's the version of the patch for this branch:

nut-2.7.4-usbhid-improvements.txt

There's also an unrelated makefile problem when building from a fresh git checkout, I'll put together a pull request for that separately. Fedora's patches for its RPM package also needed some work, I'll figure out what to do with them later...

gcomes commented 6 years ago

I also have this kind of problem (nut 2.7.4, usbhid-ups randomly crashing and requiring a restart). I'm using opensuse 42.3 and I fixed the problem using the usbhid-closehandle.txt patch with a small modification:

diff -Nru nut-2.7.4.orig/drivers/usbhid-ups.c nut-2.7.4/drivers/usbhid-ups.c
--- nut-2.7.4.orig/drivers/usbhid-ups.c 2017-04-04 16:23:07.066706595 +0200
+++ nut-2.7.4/drivers/usbhid-ups.c      2017-04-04 16:24:34.977580518 +0200
@@ -1356,6 +1356,9 @@
 {
        int ret;

+       /* Try to close the previous handle */
+       comm_driver->close(udev);
+
        upsdebugx(4, "==================================================");
        upsdebugx(4, "= device has been disconnected, try to reconnect =");
        upsdebugx(4, "==================================================");
llitz commented 3 years ago

@aquette - Any chance of this patch being merged? I was having hitting the same issue constantly after adding a second USB UPS to the same server and this has solved my issues.

jimklimov commented 3 years ago

Thanks for bringing the patch to attention, sadly it was not as noticeable as a "real" pull request. I hope to check it in a few days.

On Fri, Oct 8, 2021, 19:07 llitz @.***> wrote:

@aquette https://github.com/aquette - Any chance of this patch being merged? I was having hitting the same issue constantly after adding a second USB UPS to the same server and this has solved my issues.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/networkupstools/nut/issues/414#issuecomment-938877915, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAMPTFAGAOZSYO6545ZY36DUF4QNPANCNFSM4DGCF5CQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

opoplawski commented 2 years ago

Ping? I seem to be running into this as well.

jimklimov commented 2 years ago

Oh, right, thanks! PR posted, finally. Thanks for the nudge!

jimklimov commented 2 years ago

PR with prospect fix was merged. Did not help one system I monitor where it seems USB driver in the OS locked up and says nothing, and can't be reloaded, and the box is in the middle of some work so can't be rebooted to unwedge it either. But at least it logs trying to get vendor/product/serial 3 times now :)