NIRSpec BOTS data with NINT=3000 crashes TSO3

stscijgbot-jp commented 4 years ago

Issue JP-1505 was created on JIRA by James Muzerolle:

When processing a NIRSpec BOTS exposure containing 3000 integrations through calwebb_tso3, the pipeline crashed at the outlier detection step with the error message "combine: too many arrays". Is there a limit on the number of integrations the step can handle? I was able to successfully process a similar exposure with only 600 integrations.

stscijgbot-jp commented 4 years ago

Comment by David Davis on JIRA:

This is a limitation of stsci.image/src/_combinemodule.c. The module has a limit of 1800 input images

#define MAX_ARRAYS 1800

As we requested. (Bump number up as H. Bushouse requested).

What should we request for TSO observations? Is there a practical limit?

stscijgbot-jp commented 4 years ago

Comment by James Muzerolle on JIRA:

Yes, the number of integrations per exposure cannot exceed 65535.

stscijgbot-jp commented 4 years ago

Comment by Howard Bushouse on JIRA:

I'm wondering if another way to fix this (rather than bumping up MAX_ARRAYS to 65535) would be to limit the number of integrations used in the outlier_detection step to form the median image, which is then used to compare to each individual integration to look for outliers. I'm guessing that using "only" 1800 integrations would do a pretty good job of creating a useful median that has all forms of outliers rejected. We would then still compare all integrations to that median, so we would not be limiting the number of integrations that actually get outliers detected. We could potentially even use a smaller limit on the number images/integrations going into the median, just to speed things up. Say 1000, for a nice round number (or 1001, if it's easier for computing the median from an odd number of inputs).

This would require a modification to the outlier_detection code to only use some max number of images in the input stack when forming the median. We don't want to limit the input stack itself, because all integrations need to have the detection applied to them.

stscijgbot-jp commented 4 years ago

Comment by James Muzerolle on JIRA:

In principle, this solution sounds reasonable. Does the max_arrays limit also apply to the other instruments' TSO modes, and if so, are they happy with it?

stscijgbot-jp commented 4 years ago

Comment by Howard Bushouse on JIRA:

Yes, the limit would apply equally to all occasions where a stack of integrations/images is being run through outlier_detection, even in normal imaging cases where exposure-averaged images are being used (although I really doubt anyone will ever have >1000 individual image exposures to combine). I can't see any scientific case where forming a median from 10k or 30k sets of data is going to be much better than using only 1k, but I guess it's possible.

stscijgbot-jp commented 4 years ago

Comment by Jeff Valenti on JIRA:

Nestor Espinoza, Sarah Kendrew, Nikolay Nikolov - you may want to watch and perhaps comment on this ticket.

stscijgbot-jp commented 4 years ago

Comment by Sarah Kendrew on JIRA:

thanks for tagging me Jeff Valenti. I agree with Howard Bushouse that this seems like a reasonable solution. This would always be the first 1000 integrations of an exposure I presume?

stscijgbot-jp commented 4 years ago

Comment by Howard Bushouse on JIRA:

Taking the first 1000 integrations would certainly be the most straightforward approach and is probably what we'd implement at least to start with. I suppose that for really large numbers of integrations (e.g. tens of thousands) something a little more sophisticated could be implemented in which we do something like using every 10th or 30th (or whatever's appropriate) integration, just so the median is composed of data covering the whole time range of the exposure.

stscijgbot-jp commented 4 years ago

Comment by Nikolay Nikolov on JIRA:

Both imaging and spectroscopy TSO data sets are expected to be highly uniform in terms of PSF shape and location on the detector. Taking the sequence of the first (middle, last, etc.) 500-1000 images would be sufficient to build high SNR median combined image (MCI). However, the flux of the MCI would need to be scaled to the flux of each element of the time series (image_i) before taking the difference image_i - MCI. This is needed to take into account variability associated with each science case, e.g. exoplanet transits/eclipses/phase curves, supernova, etc. Otherwise, there will be residual flux in the difference image that could be detected and marked. The same would be expected even if one median combines the entire time series. I didn't see this consideration here (but might be missing something): https://jwst-pipeline.readthedocs.io/en/stable/jwst/outlier_detection/index.html#outlier-detection-step

stscijgbot-jp commented 4 years ago

Comment by Sarah Kendrew on JIRA:

good point Nikolay Nikolov - it would be good to check what the "statistical comparison" entails for TSOs so that any changes in flux between individual images and the MCI do not get flagged as "outliers". (Naively I would assume that the subtle flux changes associated with a typical TSO target fall below any sensible "outlier" threshold - but let's make sure). Can you comment on that Howard Bushouse?

stscijgbot-jp commented 4 years ago

Comment by Howard Bushouse on JIRA:

That thought had occurred to at least 1 or 2 of us back when we were adapting the regular imaging-based outlier detection to work with stacks of TSO data. Turns out that a "scaled" version of the comparison was implemented, which scales the MCI to match the overall flux of each individual image in order to avoid false positives due to changes in source flux. However, this was only implemented for the imaging mode of TSO (by doing simple aperture photometry on the source of interest in each image). It has not yet been implemented for spectroscopic modes of TSO. I guess some sort of overall flux measurement (similar to what the "white-light" pipeline step does by integrating spectra over wavelength) would need to be computed in order to do a similar kind of scaling operation in outlier detection.

stscijgbot-jp commented 4 years ago

Comment by Nikolay Nikolov on JIRA:

Fitting the MCI with a single scaling parameter for the maximal/average/median flux would work well. One limitation for this method could be the CRs affecting image_i (i.e. 2d spectrum), i.e. the image for which we would want to take the difference with the MCI. With many CRs (i.e. long integrations) the scaling could be biased toward higher values. One solution could be to simply use the median in such cases (or perhaps in all).

Another way would be to define a dynamic MCI from a set of say 300-400 sequential frames, that slide along the time series centered on image_i. In that way the entire subset would vary in flux with image_i, so the resulting MCI would also have flux close the the flux of image_i.

stscijgbot-jp commented 4 years ago

Comment by Howard Bushouse on JIRA:

Fixed by #5114

spacetelescope / jwst

NIRSpec BOTS data with NINT=3000 crashes TSO3 #5035