NDP hard to recover after a power outage

lwa-project / ng_digital_processor

The Next Generation Digital Processor for LWA North Arm

Apache License 2.0

0 stars 0 forks source link

NDP hard to recover after a power outage #33

Closed jaycedowell closed 3 months ago

jaycedowell commented 6 months ago

It was a little rough getting North Arm back up after the power outage:

NAL didn't try to bring anything up
The first NDP INI said it was successful but it looked like all of the DRX pipelines were dead/hung.
Restarting the DRX pipelines helped but snap01 wasn't sending any data.
Another NDP INI didn't fix the "no data from snap01" problem.
Power cycling the snaps and another NDP INI did get things working again.

Questions:

Why did the DRX pipelines not start up?
Why wasn't the DRX pipelines problem not detected as an error? Did I not wait long enough?
Why wasn't the the snap01 problem detected as an error?

jaycedowell commented 6 months ago

Something similar happened this morning (April 9) with ndp-drx-0dying (snap01 seems ok as does ndp-drx-1). NDP did go into error because of T-engine packet loss of 50%.

Update: Oh:

Apr 09 03:39:17 ndp1 ndp-drx-0[9513]: terminate called after throwing an instance of 'VerbsSend::Error'
Apr 09 03:39:17 ndp1 ndp-drx-0[9513]:   what():  Failed to determine remote hardware address: (0) Success
Apr 09 03:39:17 ndp1 systemd[1]: ndp-drx-0.service: Main process exited, code=killed, status=6/ABRT
Apr 09 03:39:17 ndp1 systemd[1]: ndp-drx-0.service: Failed with result 'signal'.

So this could be a startup condition where someone wasn't ready when -0 launched?

GregBTaylor commented 6 months ago

On the morning of April 9th, (around 12:20am), there was a power outage at NA. At about 3am I did an INI on ASP and then an INI on NDP. Both systems appeared to fully recover and reported no errors however the sky was blank.

jaycedowell commented 6 months ago

Did you happen to look at like_bmon.py to see what was unhappy inside NDP?

jaycedowell commented 6 months ago

6c3810e, 37b7ec8, and 887e0c2 are now active.

jaycedowell commented 6 months ago

@ctaylor-physics reports that NDP was recovered with a single INI on April 14.

jaycedowell commented 3 months ago

It seems better now if you ignore (1) how Orville sometimes doesn't automatically resume imaging (not really a NDP problem as far as I can tell) and (2) the startup packet loss thing (#30).