spacetelescope / jwst

Python library for science observations from the James Webb Space Telescope
https://jwst-pipeline.readthedocs.io/en/latest/
Other
562 stars 167 forks source link

ramp_fit memory usage #2144

Open hbushouse opened 6 years ago

hbushouse commented 6 years ago

JIRA ticket https://jira.stsci.edu/browse/JP-323 reports problems processing a (rather large) NIRSpec BrightObj (TSO) exposure in calwebb_detector1, with the processing going on for hours and hours and eventually crashing.

The dataset in question is NRS_BRIGHTOBJ mode, using the SUB2048 (2048 x 32) subarray, NGROUPS=3 and NINTS=3000. The size of the level-1b (uncal) file is ~1 GB.

I did some trial runs of calwebb_detector1 processing on my system that has 32 GB of RAM and found some interesting behavior. The 2 steps with the longest processing time are unsurprisingly jump and ramp_fit, because of 3000 integrations to process. The output of the jump step was saved to a _ramp product. Examination of the group_dq array in the _ramp file showed that many pixels, in all integrations, have group 3 flagged as an outlier. So that means ramp_fit is left to deal with only 2 groups in many situations.

The ramp_fit step now does its processing in 3 main loops or phases. I tracked the processing time and memory usage of each phase.

At the end of processing ramp_fit reported:

The number of pixels having insufficient data due to excessive CRs or saturation 1432
Number of pixels in 2D array: 65536
Shape of 2D image: (32, 2048)
Shape of data cube: (3, 32, 2048)
Buffer size (bytes): 30720000
Number of rows per buffer: 32
Number of groups per integration: 3
Number of integrations: 3000
The execution time in seconds: 26386.010621

The total execution time translates to 440 mins or 7.3 hours (!).

Processing did succeed, but it obviously took very long and nearly exhausted the RAM on my system. The steady increase in RAM usage during Phase 1 of ramp_fit is at least interesting, if not actually worrisome. Should that be happening? Are there arrays that are being steadily built-up during that phase? Or do we have a problem with memory not getting freed properly at the end of the processing for each integration?

The dataset in question is available at: /grp/jwst/ssb/bushouse/jwst_data/NIRSpec/BrightObj/BOTS_uncal.fits. The BOTS_ramp.fits file is also there, which can be used as input directly to the ramp_fit step (to avoid having to redo all the upstream processing).

hbushouse commented 6 years ago

There's a similar problem occurring for a NIRISS SOSS TSO exposure in the DMS test data cache. In the DMS test environment on the C-string the exposure "jw10003001001_03101_00001-seg002_nis" is failing in calwebb_tso1 processing during the ramp_fit step with the simple error "Killed" showing up in the processing log. The suspected problem is running out of memory, although I would've thought the test environment would have enough RAM to handle this.

The latest DMS run of this dataset can be found on the C-string in "/ifs/int/jwstc/store/doggett/tests/run273/", with the error log "/ifs/int/jwstc/owl/logs/doggett_jw10003001001_03101_00001-seg002_nis_1528446667.361178/ALOG_1528447835_level_2a_jw10003001001_03101_00001-seg002_nis.err."

Interestingly, "seg001" of this exposure, which is the same size as "seg002", succeeds in completing the ramp_fit step and calwebb_tso1 processing.