Open mmccrackan opened 1 month ago
Yes, we have known about these for a while, they crashed the mapmaker many many times and we had to put specific try catch statements to avoid that. I'm not sure why this happens.
This issue goes all the way down our software stack to the streamer (https://github.com/simonsobs/smurf-streamer/issues/40) so I will assign @jlashner
This is a data integrity issue, and might indicate deeper problems; thanks for creating an issue.
Questions I'd like to have answers too ASAP:
Whether or not this gets fixed upstream, we will probably want a mitigation for existing data (unless it's extremely rare). I don't really like the approach of "oh just put in default filter params if they're not there"... that defeats all the care we've taken to record the state of the system. I would rather do this with an external (obsdb/obs_info) tag that confirms, for certain obs, that it's safe to jam in filter parameters.
I don't have a full quantitative response, but I did write a script to scan satp1 data to check for this problem, and though I didn't check the full archive, I couldn't find any examples of it.
Is it true this only happens for satp3?
My current theory is that this happens whenever Satp3 uses jackhammer hammer --soft
instead of doing a full hammer.
We have it configured so that we should receive a message containing the filter params whenever they are updated or queried (configured here).
When performing a full hammer, the setup function will set defaults for all hardware registers. I'm wondering if this is this is the only time we receive metadata about the filter parameters, in which case restarting the streamer without re-running the full setup would result in some missing metadata. Before, I was under the impression that rogue dumped all monitored parameters to us on initialization, but maybe it is just happening when writing these defaults.
There are currently other problems with soft-hammering and setup that we are trying to understand, so we've already recommended that satp3 stop using that function. However to address this more thoroughly I think we can either make sure to set all register defaults even when soft-hammering, or make sure we are polling these registers regularly, like every 30 seconds or so.
Just some statistics on how common the issue is. From the satp3 site-pipeline run I did, it occurred on 139 separate observations out of a total of 1530 observations. Including different bands and wafers, it happened 508 times.
There were none in the satp1 run.
I ended up running some tests on SATp1 and determined that soft-hammering the smurf is indeed the problem. I put some notes about what I tested on my confluence page here.
Unfortunately, I think we cannot trust the filter parameters for these observations to match those from other observations. After a soft-hammer, we don't run the function that sets all rogue vars based on a configuration file, so the filter param variables will be uninitialized, or initialized to something incorrect (I think the base defaults are set here).
Moving forward I think we should just remove the option to softhammer from jackhammer.
Actually, I think those parameters in _SmurfProcessor.py file match what we normally run with, so might be ok to assume they're correct.
This is a data integrity issue, and might indicate deeper problems; thanks for creating an issue.
Questions I'd like to have answers too ASAP:
- How often does this happen?
- Do we detect any cases where the readout filter indeed seems to be misconfigured (not just reporting incorrectly)?
Whether or not this gets fixed upstream, we will probably want a mitigation for existing data (unless it's extremely rare). I don't really like the approach of "oh just put in default filter params if they're not there"... that defeats all the care we've taken to record the state of the system. I would rather do this with an external (obsdb/obs_info) tag that confirms, for certain obs, that it's safe to jam in filter parameters.
@mhasself can you elaborate on what you think consistutes safe and what info you want added to the obsdb (or a new external obsdb)? I think our only option when they're missing is to just assume they're the same as the data that doesn't have them missing.
@mhasself can you elaborate on what you think consistutes safe and what info you want added to the obsdb (or a new external obsdb)? I think our only option when they're missing is to just assume they're the same as the data that doesn't have them missing.
It would be an external obsdb. "Safe" means you're confident that you know how the filter is configured. So if @jlashner convinces us that in this case the Smurfs are running with some default parameters, and we know what those default parameters are, and we have looked at a few noise spectra to confirm the filter seems to be doing that ... then it's pretty safe. So we mark all events we know about as "assume_iir_params=1", and have code that turns that into some iir_params in the aman.
Hopefully this doesn't happen any more, but if it does we will want to know about it, and not just quietly patch everything over. Because the failure cause, in the future, could be different from what's happening here, and it might not be true that we can assume_iir_params=1.
I searched and didn't see any issues for this error specifically, but for some observations, the
iir_params
dictionary populated by_load_book_detset
inload_book.py
containsNone
for the IIR filter parametersa
andb
such that thefourier_filter
step fails with:"/global/homes/m/mmccrack/so_home/repos/20240819_catchup/sotodlib/sotodlib/tod_ops/filters.py", line 522, in iir_filter B, A = np.polyval(b[::-1], z), np.polyval(a[::-1], z)
TypeError: 'NoneType' object is not subscriptable
The other parameters in the
iir_params
dictionary (enabled
andfscale
) are populated. There are checks in theiir_filter
function infilters.py
to seea
andb
were function inputs and it defaults to those in thetod
axis manager if they are not, though there are no checks to see if they areNone
in the axis manager dictionary itself.An example observation where this occurs is:
obsid = 'obs_1717755016_satp3_1111111' dets={'wafer_slot':'ws4', 'wafer.bandpass':'f150'}