lwa-project / ng_digital_processor

The Next Generation Digital Processor for LWA North Arm
Apache License 2.0
0 stars 0 forks source link

NDP hard to recover after a power outage #33

Closed jaycedowell closed 3 months ago

jaycedowell commented 6 months ago

It was a little rough getting North Arm back up after the power outage:

Questions:

  1. Why did the DRX pipelines not start up?
  2. Why wasn't the DRX pipelines problem not detected as an error? Did I not wait long enough?
  3. Why wasn't the the snap01 problem detected as an error?
jaycedowell commented 6 months ago

Something similar happened this morning (April 9) with ndp-drx-0dying (snap01 seems ok as does ndp-drx-1). NDP did go into error because of T-engine packet loss of 50%.

Update: Oh:

Apr 09 03:39:17 ndp1 ndp-drx-0[9513]: terminate called after throwing an instance of 'VerbsSend::Error'
Apr 09 03:39:17 ndp1 ndp-drx-0[9513]:   what():  Failed to determine remote hardware address: (0) Success
Apr 09 03:39:17 ndp1 systemd[1]: ndp-drx-0.service: Main process exited, code=killed, status=6/ABRT
Apr 09 03:39:17 ndp1 systemd[1]: ndp-drx-0.service: Failed with result 'signal'.

So this could be a startup condition where someone wasn't ready when -0 launched?

GregBTaylor commented 6 months ago

On the morning of April 9th, (around 12:20am), there was a power outage at NA. At about 3am I did an INI on ASP and then an INI on NDP. Both systems appeared to fully recover and reported no errors however the sky was blank.

jaycedowell commented 6 months ago

Did you happen to look at like_bmon.py to see what was unhappy inside NDP?

jaycedowell commented 6 months ago

6c3810e, 37b7ec8, and 887e0c2 are now active.

jaycedowell commented 6 months ago

@ctaylor-physics reports that NDP was recovered with a single INI on April 14.

jaycedowell commented 3 months ago

It seems better now if you ignore (1) how Orville sometimes doesn't automatically resume imaging (not really a NDP problem as far as I can tell) and (2) the startup packet loss thing (#30).