lwa-project / lwa_sv

The Advanced Digital Processor at LWA Sevilleta
Apache License 2.0
1 stars 1 forks source link

Make INI more robust #21

Open jaycedowell opened 1 year ago

jaycedowell commented 1 year ago

The Problem

The INI sequence has a lot of ways that can cause it to fail. I recently looked at the the failures over the last 21 adp-control.log files and it looks like the two main (non-ROACH2 related) culprits are: 1) Processing start time mis-match and 2) Pipeline launch failures.

For (1) the problem looks to be that, sometimes, the head node ends up on the wrong side of the second to what the ROACH2s report. Something like the head node records a time of 1.9998 s and the ROACH2s are at 2.0001 s. Casting to int causes them not to match.

For (2) it's a little more complicated. Sometimes it is a TBN pipeline on some node that dies. Other times a node throws a BF_STATUS_DEVICE_ERROR on all pipelines and needs a reboot.

Ideas

1) Don't cast to int. Maybe compare the two times with some kind of small tolerance (fixed at 1 ms or maybe pulled from ntpq -p offset values. 2) After killing off old pipelines at the start of INI, _run a simple check of python -c "import bifrost" to check for the BF_STATUS_DEVICE_ERROR condition_. If that is found reboot the problematic nodes (but only try to reboot once).