simonsobs / sotodlib

Simons Observatory: Time-Ordered Data processing library.
MIT License
16 stars 19 forks source link

`iir_params` keys `a` and `b` are not populated for some observations #969

Open mmccrackan opened 1 month ago

mmccrackan commented 1 month ago

I searched and didn't see any issues for this error specifically, but for some observations, the iir_params dictionary populated by _load_book_detset in load_book.py contains None for the IIR filter parameters a and b such that the fourier_filter step fails with:

"/global/homes/m/mmccrack/so_home/repos/20240819_catchup/sotodlib/sotodlib/tod_ops/filters.py", line 522, in iir_filter B, A = np.polyval(b[::-1], z), np.polyval(a[::-1], z) TypeError: 'NoneType' object is not subscriptable

The other parameters in the iir_params dictionary (enabled and fscale) are populated. There are checks in the iir_filter function in filters.py to see a and b were function inputs and it defaults to those in the tod axis manager if they are not, though there are no checks to see if they are None in the axis manager dictionary itself.

An example observation where this occurs is:

obsid = 'obs_1717755016_satp3_1111111' dets={'wafer_slot':'ws4', 'wafer.bandpass':'f150'}

chervias commented 1 month ago

Yes, we have known about these for a while, they crashed the mapmaker many many times and we had to put specific try catch statements to avoid that. I'm not sure why this happens.

msilvafe commented 1 month ago

This issue goes all the way down our software stack to the streamer (https://github.com/simonsobs/smurf-streamer/issues/40) so I will assign @jlashner

mhasself commented 1 week ago

This is a data integrity issue, and might indicate deeper problems; thanks for creating an issue.

Questions I'd like to have answers too ASAP:

Whether or not this gets fixed upstream, we will probably want a mitigation for existing data (unless it's extremely rare). I don't really like the approach of "oh just put in default filter params if they're not there"... that defeats all the care we've taken to record the state of the system. I would rather do this with an external (obsdb/obs_info) tag that confirms, for certain obs, that it's safe to jam in filter parameters.

jlashner commented 1 week ago

I don't have a full quantitative response, but I did write a script to scan satp1 data to check for this problem, and though I didn't check the full archive, I couldn't find any examples of it.

Is it true this only happens for satp3?

My current theory is that this happens whenever Satp3 uses jackhammer hammer --soft instead of doing a full hammer. We have it configured so that we should receive a message containing the filter params whenever they are updated or queried (configured here).

When performing a full hammer, the setup function will set defaults for all hardware registers. I'm wondering if this is this is the only time we receive metadata about the filter parameters, in which case restarting the streamer without re-running the full setup would result in some missing metadata. Before, I was under the impression that rogue dumped all monitored parameters to us on initialization, but maybe it is just happening when writing these defaults.

There are currently other problems with soft-hammering and setup that we are trying to understand, so we've already recommended that satp3 stop using that function. However to address this more thoroughly I think we can either make sure to set all register defaults even when soft-hammering, or make sure we are polling these registers regularly, like every 30 seconds or so.

mmccrackan commented 1 week ago

Just some statistics on how common the issue is. From the satp3 site-pipeline run I did, it occurred on 139 separate observations out of a total of 1530 observations. Including different bands and wafers, it happened 508 times.

There were none in the satp1 run.

jlashner commented 1 week ago

I ended up running some tests on SATp1 and determined that soft-hammering the smurf is indeed the problem. I put some notes about what I tested on my confluence page here.

Unfortunately, I think we cannot trust the filter parameters for these observations to match those from other observations. After a soft-hammer, we don't run the function that sets all rogue vars based on a configuration file, so the filter param variables will be uninitialized, or initialized to something incorrect (I think the base defaults are set here).

Moving forward I think we should just remove the option to softhammer from jackhammer.

jlashner commented 1 week ago

Actually, I think those parameters in _SmurfProcessor.py file match what we normally run with, so might be ok to assume they're correct.

msilvafe commented 1 week ago

This is a data integrity issue, and might indicate deeper problems; thanks for creating an issue.

Questions I'd like to have answers too ASAP:

  • How often does this happen?
  • Do we detect any cases where the readout filter indeed seems to be misconfigured (not just reporting incorrectly)?

Whether or not this gets fixed upstream, we will probably want a mitigation for existing data (unless it's extremely rare). I don't really like the approach of "oh just put in default filter params if they're not there"... that defeats all the care we've taken to record the state of the system. I would rather do this with an external (obsdb/obs_info) tag that confirms, for certain obs, that it's safe to jam in filter parameters.

@mhasself can you elaborate on what you think consistutes safe and what info you want added to the obsdb (or a new external obsdb)? I think our only option when they're missing is to just assume they're the same as the data that doesn't have them missing.

mhasself commented 1 week ago

@mhasself can you elaborate on what you think consistutes safe and what info you want added to the obsdb (or a new external obsdb)? I think our only option when they're missing is to just assume they're the same as the data that doesn't have them missing.

It would be an external obsdb. "Safe" means you're confident that you know how the filter is configured. So if @jlashner convinces us that in this case the Smurfs are running with some default parameters, and we know what those default parameters are, and we have looked at a few noise spectra to confirm the filter seems to be doing that ... then it's pretty safe. So we mark all events we know about as "assume_iir_params=1", and have code that turns that into some iir_params in the aman.

Hopefully this doesn't happen any more, but if it does we will want to know about it, and not just quietly patch everything over. Because the failure cause, in the future, could be different from what's happening here, and it might not be true that we can assume_iir_params=1.