networkupstools / nut

The Network UPS Tools repository. UPS management protocol Informational RFC 9271 published by IETF at https://www.rfc-editor.org/info/rfc9271 Please star NUT on GitHub, this helps with sponsorships!
https://networkupstools.org/
Other
1.92k stars 345 forks source link

CPS CP1500PFCLCD frequent disconnects #1450

Open jsmith432 opened 2 years ago

jsmith432 commented 2 years ago

I have 2 identical CPS CP1500PFCLCD. They were working well connected directly to Windows, and the other to a QNAP NAS. I wanted to have more devices connected to them, so I got two raspberry pi and installed NUT server on them. from time to time, the USB seems to disconnect those devices, and both the Windows client and a webnut (that I am using to monitor availability) lose connection to the host.

Are there any specific settings that work best on these? Or any workaround for those frequent disconnects? It takes sometimes several hours between disconnects.

Example, At 21:27:09, Webnut seems that is not able to connect anymore, Down | 2022-06-18 21:27:09 | Request failed with status code 500

The only messages I see on the logs are after I disconnect and reconnect it manually: /var/log/messages

Jun 18 21:29:54 raspberrypi3 kernel: [123040.530774] usb 1-1.1.2: USB disconnect, device number 8 Jun 18 21:29:56 raspberrypi3 kernel: [123042.381860] usb 1-1.1.2: new full-speed USB device number 9 using dwc_otg Jun 18 21:29:56 raspberrypi3 kernel: [123042.537210] usb 1-1.1.2: New USB device found, idVendor=0764, idProduct=0501, bcdDevice= 0.01 Jun 18 21:29:56 raspberrypi3 kernel: [123042.537295] usb 1-1.1.2: New USB device strings: Mfr=3, Product=1, SerialNumber=2 Jun 18 21:29:56 raspberrypi3 kernel: [123042.537307] usb 1-1.1.2: Product: CP1500PFCLCD Jun 18 21:29:56 raspberrypi3 kernel: [123042.537316] usb 1-1.1.2: Manufacturer: CPS Jun 18 21:29:56 raspberrypi3 kernel: [123042.537327] usb 1-1.1.2: SerialNumber: 000000000000 Jun 18 21:29:56 raspberrypi3 kernel: [123042.555495] hid-generic 0003:0764:0501.0005: hiddev96,hidraw0: USB HID v1.10 Device [CPS CP1500PFCLCD] on usb-3f980000.usb-1.1.2/input0 Jun 18 21:29:56 raspberrypi3 mtp-probe: checking bus 1, device 9: "/sys/devices/platform/soc/3f980000.usb/usb1/1-1/1-1.1/1-1.1.2" Jun 18 21:29:56 raspberrypi3 mtp-probe: bus: 1, device: 9 was not an MTP device Jun 18 21:29:56 raspberrypi3 mtp-probe: checking bus 1, device 9: "/sys/devices/platform/soc/3f980000.usb/usb1/1-1/1-1.1/1-1.1.2" Jun 18 21:29:56 raspberrypi3 mtp-probe: bus: 1, device: 9 was not an MTP device

Any ideas on how to improve this?

jimklimov commented 2 years ago

Please check the NUT issue tracker for CPS label, I think the device disconnects are a frequently reported issue with this brand, up to the point of common ways to address it (e.g. so udev won't go crazy). IIRC there was a recent PR to delay USB reconnection that might be relevant for this issue - if it is a firmware reboot. On the other side, I think there were reports that these devices restarted connections intensively if an OS or NUT driver did not communicate quickly...

jsmith432 commented 2 years ago

Thanks for the tips. I used to ran these in my NAS (that uses NUT), but I have more equipment now that I decided to run it from a Pi, that would be the on until the end. I looked into what the NAS is doing with NUT and it looks like it keeps testing the USB driver and restart the services if there is a problem (although I didn't understand all the code there). I guess I need to figure out a similar workaround.

biergaizi commented 2 years ago

@jsmith432 Can you try using an external USB Hub with its own dedicated power supply? Single-board computers like the Raspberry Pi often have unstable USB ports due to power supply limitations. Though I think it's unlikely to be the problem here (since the UPS doesn't take power from the USB port), but it's still worthwhile to rule out this cause.

jsmith432 commented 2 years ago

@biergaizi thanks for the suggestion. I tried to connect a spare powered hub yesterday but 14 hours later it still got disconnected. Right now the workaround I have is a script that tails the log and restart upsdrvctl.

Hostiss commented 2 years ago

I can confirm the above issues. I have a dedicated PC terminal (HP T620 Plus to be exact) to just act as a NUT tool, what is interesting is that, for over a week i was monitoring other model of this brand (VP1600ELCD-FR) using the same hardware and cables, and i didn't have any disconnects or whatsoever. Now, only after switching to this model (i have CP1300EPFCLCD but it's from the same family), using the exact same terminal, my own usb cables, just after changing a couple of variable names and descriptions, i had my first disconnect after like 30 minutes or so. So i don't think it's something to do with for example unstable USB ports.

I ended up with the same approach as @jsmith432 was suggesting, i made a simply node-red flow which monitors syslog and after some keywords, restarts upsdrvctl, but there is definitely something wrong.

jlg89 commented 1 year ago

I'm running into this issue with a Maruson TAK-LV1K UPS using the nutdrv_qx driver. The USB connection resets every few seconds, and eventually the Pi reboots on its own. NUT seems to function normally, though.

ups.conf:

maxretry = 5
maxstartdelay = 90

[myups]
driver = nutdrv_qx
port = auto
desc = "Maruson UPS"
override.battery.packs = 18
runtimecal = 270,100,4680,13
chargetime = 36000
default.battery.voltage.high = 40.95
default.battery.voltage.low = 31.5
default.battery.voltage.nominal = 39.0

Spontaneous reconnect sequence in /var/log/syslog (i.e. the cable isn't being touched):

May  5 08:54:29 hoobs kernel: [ 1126.782465] usb 1-1.1: new low-speed USB device number 22 using xhci_hcd
May  5 08:54:29 hoobs kernel: [ 1126.924730] usb 1-1.1: New USB device found, idVendor=06da, idProduct=0201, bcdDevice= 0.01
May  5 08:54:29 hoobs kernel: [ 1126.924744] usb 1-1.1: New USB device strings: Mfr=3, Product=1, SerialNumber=0
May  5 08:54:29 hoobs kernel: [ 1126.924750] usb 1-1.1: Product: USB Card
May  5 08:54:29 hoobs kernel: [ 1126.924755] usb 1-1.1: Manufacturer: PPC
May  5 08:54:29 hoobs kernel: [ 1126.933743] hid-generic 0003:06DA:0201.0123: hiddev96,hidraw0: USB HID v1.11 Device [PPC USB Card] on usb-0000:01:00.0-1.1/input0
May  5 08:54:29 hoobs mtp-probe: checking bus 1, device 22: "/sys/devices/platform/scb/fd500000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0/usb1/1-1/1-1.1"
May  5 08:54:29 hoobs mtp-probe: bus: 1, device: 22 was not an MTP device
May  5 08:54:29 hoobs mtp-probe: checking bus 1, device 22: "/sys/devices/platform/scb/fd500000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0/usb1/1-1/1-1.1"
May  5 08:54:29 hoobs mtp-probe: bus: 1, device: 22 was not an MTP device
May  5 08:54:33 hoobs kernel: [ 1130.561932] usb 1-1.1: USB disconnect, device number 22
May  5 08:54:33 hoobs kernel: [ 1130.862538] usb 1-1.1: new low-speed USB device number 23 using xhci_hcd
May  5 08:54:33 hoobs kernel: [ 1131.004851] usb 1-1.1: New USB device found, idVendor=06da, idProduct=0201, bcdDevice= 0.01
May  5 08:54:33 hoobs kernel: [ 1131.004869] usb 1-1.1: New USB device strings: Mfr=3, Product=1, SerialNumber=0
May  5 08:54:33 hoobs kernel: [ 1131.004878] usb 1-1.1: Product: USB Card
May  5 08:54:33 hoobs kernel: [ 1131.004885] usb 1-1.1: Manufacturer: PPC
May  5 08:54:33 hoobs kernel: [ 1131.014488] hid-generic 0003:06DA:0201.0124: hiddev96,hidraw0: USB HID v1.11 Device [PPC USB Card] on usb-0000:01:00.0-1.1/input0
May  5 08:54:33 hoobs mtp-probe: checking bus 1, device 23: "/sys/devices/platform/scb/fd500000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0/usb1/1-1/1-1.1"
May  5 08:54:33 hoobs mtp-probe: bus: 1, device: 23 was not an MTP device
May  5 08:54:33 hoobs mtp-probe: checking bus 1, device 23: "/sys/devices/platform/scb/fd500000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0/usb1/1-1/1-1.1"
May  5 08:54:33 hoobs mtp-probe: bus: 1, device: 23 was not an MTP device
May  5 08:54:37 hoobs kernel: [ 1134.658698] usb 1-1.1: USB disconnect, device number 23

I tried creating a rule for udev, in /etc/udev/rules.d/90-local.rules but it didn't have an effect:

SUBSYSTEMS=="usb", ENV{MODALIAS}=="usb:06da:0201", ENV{MODALIAS}="usb-card"

Interesting note: When I add "pollinterval = 10" to ups.conf, these reconnects then occur every 20 seconds instead of every 4 seconds (i.e. double the poll interval, for some reason). So it definitely has something to do with the way NUT is interacting with the USB.

jlg89 commented 1 year ago

FWIW I've tried a different USB cable, ran rpi-update and rpi-eeprom-update. No change. I'll try putting a powered hub in there and see if that helps.

jimklimov commented 1 year ago

Wildly guessing if mtp-probe induces problems? Can it be disabled for this ID (or completely for the test)?

jlg89 commented 1 year ago

I redid /etc/udev/rules.d/90-local.rules to:

ATTR{idVendor}=="06da", ATTR{idProduct}=="0201", ENV{MTP_NO_PROBE}="1"

...but I still got the "...was not an MTP device" messages in syslog. Finally found the way to disable MTP at https://superuser.com/a/1258799 and the MTP messages are gone, but that still didn't stop the USB reconnects.

jimklimov commented 1 year ago

Is there anything else on that USB bus/hub that may be causing the resets? is there a separate hub involved that may have a buggy port-reset implemented as a hub reset?

jimklimov commented 1 year ago

Is there anything else on that USB bus/hub that may be causing the resets?

Is there a separate hub involved that may have a buggy port-reset implemented as a hub-reset?

Is there a different firmware for the UPS?

I'd generally suspect power-saving, but 3-4 sec feels too soon for it to kick in...

jlg89 commented 1 year ago

The UPS is the only USB device connected to the Pi, and the resets occur at double the NUT polling rate. I think default is 2 seconds, if I read right, and the resets occur at 4 second intervals. When I set "pollinterval = 10" in ups.conf, resets then occur at 20 second intervals.

I just received a powered USB hub, I'll connect it between the Pi and the UPS this evening and see if that makes any difference. I doubt that it will, but it's worth a shot for the sake of troubleshooting.

jimklimov commented 1 year ago

The double polling rate correlation sounds interesting, although does not ring a bell exactly.

IIRC there were some (serial Megatec Qx? maybe huawei2000 or some other recently added?) devices that were a bit slow on controller side, and/or supported a different protocol dialect and did not handle deviations well, so missed part of the driver-emitted data stream, got confused and reset themselves. This at least feels consistent with effectively getting a reset every second (third?) data update. Not sure if that applies to USB packet-based communications though. However Qx as used here is a serial or serial-shim-over-USB protocol, and it has a ton of dialects that may be sensitive to contents of the ASCIIish strings in the protocol, so this may be a lead. And it may be a different issue from the hijacked original post after all :)

Possibly, running the driver with more debug verbosity (practical max is 6 or so) can give more hints to what is sent there and back, and what smells fishy.

You might have an easier time with debugging setup using a master-branch build (even if not installing over your packaged setup - see https://github.com/networkupstools/nut/wiki/Building-NUT-for-in%E2%80%90place-upgrades-or-non%E2%80%90disruptive-tests for example); more so if you'd have to add some in-place logging at interesting spots (with upsdebug_hex() probably).

jlg89 commented 1 year ago

I inserted a powered USB hub between the Pi and the UPS, no change at all. So I'll see what I can do with some debug logging.

jlg89 commented 1 year ago

I put "DEBUG_MIN 6" in ups.conf but I'm not seeing any additional log entries in syslog or kern.log. Am I looking in the wrong place?

jimklimov commented 1 year ago

Just in case: are you doing this with NUT 2.8.0 release or newer (e.g. custom build of current master)? The debug_min ability arrived that recently (is not in 2.7.4 or older, which is too often what packages still deliver)...

For those older NUT releases, you would have to edit their init-script to pass -DDDDDD among CLI arguments to the driver program, and in case of systemd service units - probably change the Type of unit since debug implies not forking (so staying fore-grounded).

Also since your issue is quick to reproduce, you might just stop the running driver daemon/unit, and run its program from the terminal, e.g.:

:; /lib/nut/nutdrv_qx -DDDDDD -a myUPS
jlg89 commented 1 year ago

There we go. The package hasn't been updated, it's 2.7.4. I killed all the NUT processes and ran nutdrv_qx as you suggested. Log file attached. I'm not sure what to look for in there. It looks like it takes the driver awhile to figure out how to talk to the UPS, and it always throws a "Device or resource busy" error even when it's able to communicate. You can see the 10-second polling happening...the reason it updates at twice the pollinterval is that it fails every other attempt. Correct me if I'm wrong, but it looks like something else is trying to access the USB device. Not sure what that would be.

Also, should we move this to its own thread, since as you noted this might be a separate issue from the OP's?

NUTlog.txt

jimklimov commented 1 year ago

Logs look confusing: the repetitive and numerous failed to claim USB device: could not claim interface 0: Device or resource busy and send: error sending control message: Broken pipe (-32), send: error sending control message: No such device (-19), qx_process_answer: short reply (...), Device reset handled, etc. after just about any piecemeal request like the basic Q1 despite often getting a valid reply like read: (124.4 105.6 120.2 016 60.0 2.32 00.0 00000001 suggests to me that the Qx protocols are generally supported here (replies seem valid) but end up resetting the controller/connection. It takes much more than a few seconds to even initialize the device state after 41 seconds.

Either that FW is buggy (possibly does not really support the basic Q1 protocol tried; maybe supports another and you'd benefit from checking with a matrix of subdriver and/or protocol values to find a better fit than auto-detection, or something else in the system tries to grab the device so there is a tug-and-pull to who owns the access exclusively currently (did the service-managed driver instance really stop?)

You might also try blazer_usb (or possibly blazer_ser over suitable port=/dev/ttyS*) as another implementation of Qx protocol support. It is supposed to be a subset of newer nutdrv_qx, but rumor has it they might differ in practice, especially in older codebase snapshots :\

jlg89 commented 1 year ago

The "right" thing to do is to upgrade to 2.8 so I'll be working with the latest code, but that's going to be a manual process, unless the RPi packager is working on it by some miracle.

jimklimov commented 1 year ago

Most RPi OSes I know are Debian-based, so plucking a dpkg file from sid/experimental repo may be an option.

The 2.8.0 release however did have some issues fixed since, and 2.8.1 is not tagged yet, so there may be a point in building an "in-place" replacement configuration (see wiki).