networkupstools / nut

The Network UPS Tools repository. UPS management protocol Informational RFC 9271 published by IETF at https://www.rfc-editor.org/info/rfc9271 Please star NUT on GitHub, this helps with sponsorships!
https://networkupstools.org/
Other
1.93k stars 346 forks source link

Monitor multiple bcmxcp devices troubles #684

Open kyuferev opened 5 years ago

kyuferev commented 5 years ago

Hi there. Pre-history: #597 I have a setup of one OrangePi One board with 3 USB hubs (each hub has its own PSU) connected to each other and 17 UPSes to monitor. Most of them are Eaton PW9120 6000i and there also are some PW9130. PW9120 are connected to USB hubs via RS232-USB cable based on PL2303 chip and are using bcmxcp driver. PW9130 are connected via USB A - USB B cables and are using usbhid-ups driver. I'm using custom shell script to remake ups.conf after each reboot and I also have zabbix-agent installed with a bunch of user parameters. And I have two major problems with this setup:

  1. The board itself hangs with continuous "CPU stalled" errors on an external display and gets back only after disconnecting PSU and reconnecting it back. But the problem goes away if I reduce the number of connected UPSes to 10. Is there a limit to a number of simultaneously monitored UPSes? Or is it the board itself can't handle the load (it's pretty cheap)?
  2. Just after the reboot when my shell script makes correct ups.conf everything works fine. But some hours later (it may be 1 or 2 hours later, sometimes more than 5) bcmxcp driver starts to fail and throws
    bcmxcp[1277]: Communications with UPS lost: Error executing command
    bcmxcp[1277]: Short read from UPS
    upsd[1347]: Data for UPS [ups-6] is stale - check driver

    to a syslog. It always starts with one (not the same one tho) UPS but some time after first failure other bcmxcp UPSes start to fail too. Sometimes this failures are accompanied with disconnects from /dev/ttyUSB* devices. In this case I'm also getting port N disabled by hub (EMI?), re-enabling... and then Enable to enumerate USB device errors in syslog. I must admit that the whole setup is placed in a harsh environment with a lot of electromagnetic interferences around. But some months ago when I was just testing this kind of a setup everything worked fine and stable and I was able to acquire all the data I needed.

I've already tried to switch cables/hubs/OrangePi board/PSUs but no success, the same scenario all the way. I've run out of ideas how to fix this. What can I do to get more debug data? Why bcmxcp devices keep disconnecting?

kyuferev commented 5 years ago

I've just catched another UPS getting disconnected with "short read from UPS" errors in syslog. But this time UPS got back after upsdrvctl start upsname with two Communications with UPS lost: Receive error (Requested only mode command): 4!!! errors

kyuferev commented 5 years ago

And usbhid-ups device failed with Can't connect to UPS [ups-8] (usbhid-ups-ups-8): No such file or directory. upsdrvctl:

root@upsmon:~# upsc ups-8
Init SSL without certificate database
Error: Driver not connected
root@upsmon:~# upsdrvctl start ups-8
Network UPS Tools - UPS driver controller 2.7.2
Network UPS Tools - Generic HID driver 0.38 (2.7.2)
USB communication driver 0.32
No matching HID UPS found
Driver failed to start (exit status=1)
Network UPS Tools - Generic HID driver 0.38 (2.7.2)
USB communication driver 0.32
No matching HID UPS found
Driver failed to start (exit status=1)
Network UPS Tools - Generic HID driver 0.38 (2.7.2)
USB communication driver 0.32
No matching HID UPS found
Driver failed to start (exit status=1)
aquette commented 5 years ago

Hi

  1. There is no specific limit on the nut side, but usb may be drawing too much power if your hubs are not self powered.

  2. Try lowering pollinterval in ups.conf. the ups may be flooded. Also check the battery test period. Look around MAXAGE in upsd.conf and possibly upsmon.conf

Cheers

aquette commented 5 years ago

Btw, some hard reset of the units may be required to get back to a sane situation. Power down, unplug the power cable, count to 10, ...

kyuferev commented 5 years ago
1. There is no specific limit on the nut side, but usb may be drawing too much power if your hubs are not self powered.

All hubs are self powered so it isn't a problem.

2. Try lowering pollinterval in ups.conf. the ups may be flooded. Also check the battery test period. Look around MAXAGE in upsd.conf and possibly upsmon.conf

But if I'll lower pollinterval value there will be even more requests sent to the UPS or am I getting it wrong? Right now I have pollinterval = 10 in ups.conf and MAXAGE 25 in upsd.conf. I've tried to fix the problem by modifying this parameters but no success. Should I increase MAXAGE parameter even more?

I should also note that I have zabbix-agent installed on this board and he is the one that collects data from the NUT. There are alot (100+) custom parameters configured and they are pulled every 15 seconds.

kyuferev commented 5 years ago

Btw, some hard reset of the units may be required to get back to a sane situation. Power down, unplug the power cable, count to 10, ...

Unfortunately that's impossible because of large amount of servers that are connected to these UPSes.

kyuferev commented 5 years ago

UPS got disconnected again. No errors in dmesg, all /dev/ttyUSB devices are in place. In syslog:

bcmxcp[1450]: Communications with UPS lost: Error executing command
bcmxcp[1450]: Short read from UPS
upsd[1452]: Data for UPS [ups-12] is stale - check driver

No response from UPS:

~# upsdrvctl -DDD start ups-12
Network UPS Tools - UPS driver controller 2.7.2
   0.000000
   0.004203     Starting UPS: ups-12
   0.004715     3 remaining attempts
   0.005061     exec:  /lib/nut/bcmxcp -a ups-12
Network UPS Tools - BCMXCP UPS driver 0.28 (2.7.2)
RS-232 communication subdriver 0.20
No response from UPS on /dev/ttyUSB6 with baudrate 9600
Attempting to autodect baudrate
Can't connect to the UPS on port /dev/ttyUSB6!

  44.629455     Driver failed to start (exit status=1)
  49.629867     2 remaining attempts
  49.630052     exec:  /lib/nut/bcmxcp -a ups-12
Network UPS Tools - BCMXCP UPS driver 0.28 (2.7.2)
RS-232 communication subdriver 0.20
No response from UPS on /dev/ttyUSB6 with baudrate 9600
Attempting to autodect baudrate
Can't connect to the UPS on port /dev/ttyUSB6!

  94.259703     Driver failed to start (exit status=1)
  99.260083     1 remaining attempts
  99.260269     exec:  /lib/nut/bcmxcp -a ups-12
Network UPS Tools - BCMXCP UPS driver 0.28 (2.7.2)
RS-232 communication subdriver 0.20
No response from UPS on /dev/ttyUSB6 with baudrate 9600
Attempting to autodect baudrate
Can't connect to the UPS on port /dev/ttyUSB6!

 143.889160     Driver failed to start (exit status=1)