networkupstools / nut

The Network UPS Tools repository. UPS management protocol Informational RFC 9271 published by IETF at https://www.rfc-editor.org/info/rfc9271 Please star NUT on GitHub, this helps with sponsorships!
https://networkupstools.org/
Other
1.96k stars 348 forks source link

bcmxcp_usb/alarm_set: result was truncated #2248

Open desertwitch opened 9 months ago

desertwitch commented 9 months ago

I have a user with an ancient Powerware 9120 that's reportedly working with bcmxcp_usb on NUT 2.8.1, apart from the syslogs getting spammed with the following warnings (?):

Dec 23 11:21:27 FILESERVER bcmxcp_usb[11063]: alarm_set: result was truncated while setting or appending alarm_buf (limited to 256 bytes), with message: OUTGOING_MODEM_CALL_STARTED
Dec 23 11:21:29 FILESERVER bcmxcp_usb[11063]: alarm_set: result was truncated while setting or appending alarm_buf (limited to 256 bytes), with message: STARTUP_FAILURE_CHECK_EPO
Dec 23 11:21:29 FILESERVER bcmxcp_usb[11063]: alarm_set: result was truncated while setting or appending alarm_buf (limited to 256 bytes), with message: OUTGOING_MODEM_CALL_STARTED

Might be related to driver TO-DOs:

- Report UPS alarm status
BCM/XCP supports reporting a wide range of UPS alarm conditions.

- Report UPS statistics information
BCM/XCP supports reporting of UPS statistics data.

But wondering if those unhandled alarm states should be suppressed in the logs, or if the suspected core problem of the exceeded byte size of such alarms needs fixing and then it'd stay quiet anyhow (if unhandled).

Thanks a lot and wishing everyone happy holidays! Sent from mobile, please excuse the bad formatting.

desertwitch commented 9 months ago

User further reports these states are getting truncation warnings - UPS display itself showing no alarms at all:

SOFTWARE_INCOMPABILITY_DETECTED
IN_PARALLEL_OPERATION
CLOSE_BYPASS_BREAKER
AUTO_ON_COMMAND_EXECUTED
ON_DOUBLE_BOOST
UPS_CABINET_OVER_TEMP
TRANSFORMER_OVER_TEMP
AMBIENT_UNDER_TEMP
AMBIENT_OVER_TEMP
CABINET_DOOR_OPEN_VOLT_PRESENT
AUTO_SHUTDOWN_PENDING STAR

Which makes me wonder if it's just something cycling over all possible alarm states rather than those being actual reported states from the UPS. Problem does not occur on 2.7.4 version of NUT.

Will try to investigate some more once I'm back home from holidays.

desertwitch commented 8 months ago

Follow up: I couldn't really make sense of it myself code-wise, unfortunately!

goodness-from-me commented 4 months ago

I have the same issue on 2.8.2 with Poweware 3105 - the log is flooded with those messages. It is kind of a problem because the /var/log gets filled up very quickly on my home server. Otherwise works fine. The issue does not occur with 2.7.4.

jimklimov commented 4 months ago

The message itself comes from https://github.com/networkupstools/nut/blob/f67ca2f746a280357c4ad7dfa964bd998544f39e/drivers/dstate.c#L1668-L1671 and comes from PR #986 - previously the truncation was silently lost.

In a way the noise is good, to discover where we have problems like this and address them directly (less disruptive than exceptions or kernel panics which have similar ultimate goals).

In this driver, alarm values are assigned from a table:

:; grep alarm_set drivers/bcmxcp*
drivers/bcmxcp.c:                                       alarm_set(bcmxcp_alarm_map[iIndex].alarm_desc);

...A rather large table with rather long strings, so probably the device has many alarms raised at once (whether it is a mis-diag from the driver or an actual very broken device - I can not say quickly); or maybe the driver fails to call {device_,}alarm_init() and clear the alarm_buf between cycles:

jimklimov commented 4 months ago

At this code range we can see that the loop does init/set/commit rituals properly: https://github.com/networkupstools/nut/blob/f67ca2f746a280357c4ad7dfa964bd998544f39e/drivers/bcmxcp.c#L1660-L1698

It sets alarms based on bcmxcp_alarm_map[iIndex].alarm_block_index >= 0 and later also answer[bcmxcp_alarm_map[iIndex].alarm_block_index] > 0, and that index is set to either active or inactive in set_alarm_support_in_alarm_map() at https://github.com/networkupstools/nut/blob/f67ca2f746a280357c4ad7dfa964bd998544f39e/drivers/bcmxcp.c#L996-L1027

While the alarm_block_index value (negative or not) specifies if the alarm is supported at all, the answer[alarm_block_index] is supposedly populated by a read at the start of that block, and contains some sort of bit-map with raised alarms - as much as the device tells us about its current health (and/or as well as the driver can interpret that):

goodness-from-me commented 4 months ago

Just FYI I have the following alarms with my device:

REMOTE_GO_TO_BYPASS
BUILDING_ALARM_6
BUILDING_ALARM_3
STATIC_SWITCH_OVER_TEMP
RECTIFIER_FAILED
POWER_SUPPLY_OVER_TEMP
POWER_SUPPLY_12V_FAIL
PHASE_A_CURRENT_LIMIT
PHASE_C_CURRENT_LIMIT
BYPASS_NOT_AVAILABLE
BYPASS_BREAKER_OPEN
INV_BOARD_ACOV_INT_TEST_FAIL I

I believe at least some of them are either false-positives, old, or simply wrong alarms - the UPS itself operations normally, battery has been replaced recently and it's going ok. Though I bought it second hand, refurbished...

jimklimov commented 4 months ago

Is anyone testing this in position to custom-build NUT and augment that upslogx() in dstate.c block posted above to print also the current alarm_buf contents and/or the ret value which is apparently too large?

Alternately, when this situation happens - does it report one or instantly many alarming states that did not fit the buffer, and what is the committed buffer value seen e.g. in upsc (presumably)?