spacetelescope / jwst

Python library for science observations from the James Webb Space Telescope
https://jwst-pipeline.readthedocs.io/en/latest/
Other
558 stars 164 forks source link

Extreme memory consumption in Image3Pipeline.resample #6154

Closed stscijgbot-jp closed 3 years ago

stscijgbot-jp commented 3 years ago

Issue JP-2152 was created on JIRA by Jonathan Eisenhamer:

In DMS, calwebb_image3 crashes with memory issues in resample; see attached log. The memory footprint at time of failure is ~65GB.

The command to use is

strun calwebb_image3 jw00624-o029_20210610t121017_image3_003_asn.json

When running on a local macos laptop, the processing completes, but the following behavior is seen:

Implication that something, after drizzle completes for each array, consumes massive amount of memory. Presumption is that this is at the failure line: resample.py:219

jdavies-st commented 3 years ago

Yes, the WCSs should always be very close. Even different visits on HST have WCS that are pretty close. Guard rails are good. I.e. we should expect the WCS to be offset from the true WCS by more than several arcsecs, right?

@stsci-hack I suspect you've come up against this issue in the HAP project. Have you implemented any such quality control checks when processing HST data in pipeline mode (no user intervention)?

stscijgbot-jp commented 1 year ago

Comment by Jonathan Eisenhamer on JIRA:

Did a bit more profiling. The offending section is resample.py:216-219. One interesting note is that if outlier_detection is not run, the memory increase is more modest, only up to 20GB from a 10GB baseline.

stscijgbot-jp commented 1 year ago

Comment by Jonathan Eisenhamer on JIRA:

Finished characterizing the issue. There are two issues:

First: A potential refactor of resample.py:216-219 is in order. The cube of output images is duplicated at least twice, by numpy.stack and numpy.reciprocal. This memory usage is open-ended because it is dependent on the number of inputs. Other duplications are occurring which result in a 3 factor increase.

Second: Investigation into why tweakreg affects the memory usage at this point in the calculation. The settings of that reference, currently jwst_nircam_pars-tweakregstep_0022.asdf, results in baseline memory usage a factor of 2 larger than with the coded defaults. The end result is the large, ~80GB, jump in memory usage.

Back of the envelope stats:

Stats from last drizzle to peak usage over the highlighted code:

stscijgbot-jp commented 1 year ago

Comment by James Davies [X] on JIRA:

As hinted above, there's 2 problems in this issue.

  1. The data arrays are garbage, which means when tweakreg is run, it finds "sources" in spurious places, matches them and ends up with ludicrous "tweaks" of 100 degree rotations, etc.

  2. This results in an output footprint of (4582, 29752) in pixels, whereas the correct footprint is a 2x2 mosaic, i.e. (4635, 9937), with tweakreg turned off.

This accounts for most of the memory usage in default DMS processing where it is turned on and uses any parameter reference file provided by the team. If I turn off tweakreg, so that large shifts and rotations of the WCS are not done (spuriously due to "matched" sources) and thus the output image size is correct (much smaller), this association uses (2.02GB -> 27.95GB) of memory on my laptop. This is the expected behavior.

So I feel like we need to put some guard rails on tweakwcs for automated pipeline processing, especially if we are expected to process garbage input.

Finally, even with tweakreg turned off and not able to generate a memory-eating output footprint, the memory usage when resampling the variance arrays is too high. This can be refactored in python, and then also investigated to be moved back into the drizzle C code.

stscijgbot-jp commented 1 year ago

Comment by Howard Bushouse on JIRA:

A check for "ludicrously big" output frames (due to bad pointing values and bad tweakreg results) was put into the resample step some time ago. Perhaps we just need to adjust (downwards) the threshold at which it aborts. Although I also agree that some kind of sanity checks in the tweakreg results themselves are perhaps warranted. But if we do implement checks like that, which for example might restrict rotations to less than 10-20 degrees, then what happens when someone wants to legitimately align real images that are known to have very different footprints on the sky (e.g. taken from different observations). Although in most cases the WCS associated with each input should at least get the alignment close and any remaining "tweaking" should be minor.

stscijgbot-jp commented 1 year ago

Comment by James Davies [X] on JIRA:

Here's a screen shot of the resampled _i2d with tweakreg skipped.

!Screen Shot 2021-07-21 at 3.14.22 PM.png|thumbnail!

You can see that there are no sources. Just pattern noise.

I think the fix for this is 2-pronged.

  1. Put guard rails on the maxium "tweak" allowed. Say several arcsec.

  2. Reduce memory used in lines 216-219 as Jonathan mentions above.

stscijgbot-jp commented 1 year ago

Comment by Hien Tran on JIRA:

During NO11 we saw this same error for the fileset jw00683-o001_20210720t154324_image3_002 

2021-07-24 04:52:12,111 - stpipe.Image3Pipeline.resample - INFO - Drizzling (2048, 2048) --> (8634, 19038)
----------------------------------------------------------------------
ERROR RUNNING STEP 'Image3Pipeline':
Unable to allocate 19.6 GiB for an array with shape (32, 8634,
19038) and data type float32
----------------------------------------------------------------------
Traceback (most recent call last):
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/bin/strun", line 28, in <module>
step = Step.from_cmdline(sys.argv[1:])
stscijgbot-jp commented 1 year ago

Comment by Howard Bushouse on JIRA:

Fixed by #6252 and #6251

stscijgbot-jp commented 1 year ago

Comment by Brian Hayden on JIRA:

We saw an error during LRE5 that we thought was similar to this: jw00821013001_02102_00001_nrs2

https://pljwdmsweb.stsci.edu/owlgui/logs/directory_view?directory_path=/ifs/archive/ops/jwst/info/owl/logs/owlmgr_jw00821013001_02102_00001_nrs2_1628553937.767378&fileset_name=jw00821013001_02102_00001_nrs2

----------------------------------------------------------------------
ERROR RUNNING STEP 'DarkPipeline':
Unable to allocate 4.35 GiB for an array with shape (200, 1458176)
and data type complex128
----------------------------------------------------------------------

Documenting here to avoid making a whole new ticket.

stscijgbot-jp commented 1 year ago

Comment by James Davies [X] on JIRA:

The DarkPipeline error is unrelated to the issue in this ticket, as darks don't run through resample or level 3 pipelines.

I would open a new ticket (not in JP) for this, as it's a memory error in DMS, likely not an issue with CAL software, allocating 4.35 GiB for an array with shape (200, 1458176) and data type complex128 is quite easy on my laptop and should be doable in DMS. So yet to be determined. The relevant traceback:

---------------------------------------------------------------------- 
ERROR RUNNING STEP 'DarkPipeline': 
Unable to allocate 4.35 GiB for an array with shape (200, 1458176) 
and data type complex128 
---------------------------------------------------------------------- 
Traceback (most recent call last): 
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/bin/strun", line 28, in <module> 
step = Step.from_cmdline(sys.argv[1:]) 
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/stpipe/step.py", line 173, in from_cmdline 
return cmdline.step_from_cmdline(args) 
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/stpipe/cmdline.py", line 339, in step_from_cmdline 
step.run(*positional) 
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/stpipe/step.py", line 407, in run 
step_result = self.process(*args) 
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/jwst/pipeline/calwebb_dark.py", line 84, in process 
input = self.refpix(input) 
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/stpipe/step.py", line 407, in run 
step_result = self.process(*args) 
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/jwst/refpix/refpix_step.py", line 49, in process 
result = irs2_subtract_reference.correct_model(input_model, 
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/jwst/refpix/irs2_subtract_reference.py", line 159, in correct_model 
data0 = subtract_reference(data0, alpha, beta, irs2_mask, 
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/jwst/refpix/irs2_subtract_reference.py", line 717, in subtract_reference 
r0[k, :, :] = np.fft.ifft(r0f[k], axis=1) * normalization 
File "<__array_function__ internals>", line 5, in ifft 
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/numpy/fft/_pocketfft.py", line 314, in ifft 
output = _raw_fft(a, n, axis, False, False, inv_norm) 
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/numpy/fft/_pocketfft.py", line 70, in _raw_fft 
r = pfi.execute(a, is_real, is_forward, fct) 
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 4.35 GiB for an array with shape (200, 1458176) and data type complex128 
2021222050059-E-ERROR-/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/jwst/pipeline/calwebb_dark.cfg FAILED (exit=1) on jw00821013001_02102_00001_nrs2_uncal.fits. 
2021222050059-D-DIAG-retry_index into retry interval array = 5 
2021222050059-E-ERROR-Memory Error likely. Waiting 3600 seconds before retry# 1... 
2021222060059-I-INFO-loopcount is 2