ramp_fit memory usage - Githubissues

JIRA ticket https://jira.stsci.edu/browse/JP-323 reports problems processing a (rather large) NIRSpec BrightObj (TSO) exposure in calwebb_detector1, with the processing going on for hours and hours and eventually crashing.

The dataset in question is NRS_BRIGHTOBJ mode, using the SUB2048 (2048 x 32) subarray, NGROUPS=3 and NINTS=3000. The size of the level-1b (uncal) file is ~1 GB.

I did some trial runs of calwebb_detector1 processing on my system that has 32 GB of RAM and found some interesting behavior. The 2 steps with the longest processing time are unsurprisingly jump and ramp_fit, because of 3000 integrations to process. The output of the jump step was saved to a _ramp product. Examination of the group_dq array in the _ramp file showed that many pixels, in all integrations, have group 3 flagged as an outlier. So that means ramp_fit is left to deal with only 2 groups in many situations.

The ramp_fit step now does its processing in 3 main loops or phases. I tracked the processing time and memory usage of each phase.

Phase 1: Took about 90 minutes to execute (~1.8 sec per integration). RAM usage at the start was ~50% and steadily increased from integration to integration, topping out at 88% by the time integration 3000 was done with phase 1.
Phase 2: Took about 340 minutes (~5.0 sec per integration). RAM usage held steady at 88% throughout.
Phase 3: Did not get an accurate timing, but was fairly quick (a few minutes at most). RAM usage slowly increased a bit from 88% to 97% (almost ran out of memory!).

At the end of processing ramp_fit reported:

The number of pixels having insufficient data due to excessive CRs or saturation 1432
Number of pixels in 2D array: 65536
Shape of 2D image: (32, 2048)
Shape of data cube: (3, 32, 2048)
Buffer size (bytes): 30720000
Number of rows per buffer: 32
Number of groups per integration: 3
Number of integrations: 3000
The execution time in seconds: 26386.010621

The total execution time translates to 440 mins or 7.3 hours (!).

Processing did succeed, but it obviously took very long and nearly exhausted the RAM on my system. The steady increase in RAM usage during Phase 1 of ramp_fit is at least interesting, if not actually worrisome. Should that be happening? Are there arrays that are being steadily built-up during that phase? Or do we have a problem with memory not getting freed properly at the end of the processing for each integration?

The dataset in question is available at: /grp/jwst/ssb/bushouse/jwst_data/NIRSpec/BrightObj/BOTS_uncal.fits. The BOTS_ramp.fits file is also there, which can be used as input directly to the ramp_fit step (to avoid having to redo all the upstream processing).

There's a similar problem occurring for a NIRISS SOSS TSO exposure in the DMS test data cache. In the DMS test environment on the C-string the exposure "jw10003001001_03101_00001-seg002_nis" is failing in calwebb_tso1 processing during the ramp_fit step with the simple error "Killed" showing up in the processing log. The suspected problem is running out of memory, although I would've thought the test environment would have enough RAM to handle this.

The latest DMS run of this dataset can be found on the C-string in "/ifs/int/jwstc/store/doggett/tests/run273/", with the error log "/ifs/int/jwstc/owl/logs/doggett_jw10003001001_03101_00001-seg002_nis_1528446667.361178/ALOG_1528447835_level_2a_jw10003001001_03101_00001-seg002_nis.err."

Interestingly, "seg001" of this exposure, which is the same size as "seg002", succeeds in completing the ramp_fit step and calwebb_tso1 processing.

spacetelescope / jwst

ramp_fit memory usage #2144