Closed jaycedowell closed 3 months ago
Something similar happened this morning (April 9) with ndp-drx-0
dying (snap01
seems ok as does ndp-drx-1
). NDP did go into error because of T-engine packet loss of 50%.
Update: Oh:
Apr 09 03:39:17 ndp1 ndp-drx-0[9513]: terminate called after throwing an instance of 'VerbsSend::Error'
Apr 09 03:39:17 ndp1 ndp-drx-0[9513]: what(): Failed to determine remote hardware address: (0) Success
Apr 09 03:39:17 ndp1 systemd[1]: ndp-drx-0.service: Main process exited, code=killed, status=6/ABRT
Apr 09 03:39:17 ndp1 systemd[1]: ndp-drx-0.service: Failed with result 'signal'.
So this could be a startup condition where someone wasn't ready when -0 launched?
On the morning of April 9th, (around 12:20am), there was a power outage at NA. At about 3am I did an INI on ASP and then an INI on NDP. Both systems appeared to fully recover and reported no errors however the sky was blank.
Did you happen to look at like_bmon.py
to see what was unhappy inside NDP?
6c3810e, 37b7ec8, and 887e0c2 are now active.
@ctaylor-physics reports that NDP was recovered with a single INI on April 14.
It seems better now if you ignore (1) how Orville sometimes doesn't automatically resume imaging (not really a NDP problem as far as I can tell) and (2) the startup packet loss thing (#30).
It was a little rough getting North Arm back up after the power outage:
NDP INI
said it was successful but it looked like all of the DRX pipelines were dead/hung.snap01
wasn't sending any data.NDP INI
didn't fix the "no data fromsnap01
" problem.NDP INI
did get things working again.Questions:
snap01
problem detected as an error?