Closed stscijgbot-jp closed 3 years ago
Yes, the WCSs should always be very close. Even different visits on HST have WCS that are pretty close. Guard rails are good. I.e. we should expect the WCS to be offset from the true WCS by more than several arcsecs, right?
@stsci-hack I suspect you've come up against this issue in the HAP project. Have you implemented any such quality control checks when processing HST data in pipeline mode (no user intervention)?
Comment by Jonathan Eisenhamer on JIRA:
Did a bit more profiling. The offending section is resample.py:216-219. One interesting note is that if outlier_detection
is not run, the memory increase is more modest, only up to 20GB from a 10GB baseline.
Comment by Jonathan Eisenhamer on JIRA:
Finished characterizing the issue. There are two issues:
First: A potential refactor of resample.py:216-219 is in order. The cube of output images is duplicated at least twice, by numpy.stack and numpy.reciprocal. This memory usage is open-ended because it is dependent on the number of inputs. Other duplications are occurring which result in a 3 factor increase.
Second: Investigation into why tweakreg
affects the memory usage at this point in the calculation. The settings of that reference, currently jwst_nircam_pars-tweakregstep_0022.asdf, results in baseline memory usage a factor of 2 larger than with the coded defaults. The end result is the large, ~80GB, jump in memory usage.
Back of the envelope stats:
Stats from last drizzle to peak usage over the highlighted code:
Comment by James Davies [X] on JIRA:
As hinted above, there's 2 problems in this issue.
The data arrays are garbage, which means when tweakreg
is run, it finds "sources" in spurious places, matches them and ends up with ludicrous "tweaks" of 100 degree rotations, etc.
This results in an output footprint of (4582, 29752) in pixels, whereas the correct footprint is a 2x2 mosaic, i.e. (4635, 9937), with tweakreg
turned off.
This accounts for most of the memory usage in default DMS processing where it is turned on and uses any parameter reference file provided by the team. If I turn off tweakreg
, so that large shifts and rotations of the WCS are not done (spuriously due to "matched" sources) and thus the output image size is correct (much smaller), this association uses (2.02GB -> 27.95GB) of memory on my laptop. This is the expected behavior.
So I feel like we need to put some guard rails on tweakwcs
for automated pipeline processing, especially if we are expected to process garbage input.
Finally, even with tweakreg
turned off and not able to generate a memory-eating output footprint, the memory usage when resampling the variance arrays is too high. This can be refactored in python, and then also investigated to be moved back into the drizzle C code.
Comment by Howard Bushouse on JIRA:
A check for "ludicrously big" output frames (due to bad pointing values and bad tweakreg results) was put into the resample step some time ago. Perhaps we just need to adjust (downwards) the threshold at which it aborts. Although I also agree that some kind of sanity checks in the tweakreg results themselves are perhaps warranted. But if we do implement checks like that, which for example might restrict rotations to less than 10-20 degrees, then what happens when someone wants to legitimately align real images that are known to have very different footprints on the sky (e.g. taken from different observations). Although in most cases the WCS associated with each input should at least get the alignment close and any remaining "tweaking" should be minor.
Comment by James Davies [X] on JIRA:
Here's a screen shot of the resampled _i2d with tweakreg
skipped.
!Screen Shot 2021-07-21 at 3.14.22 PM.png|thumbnail!
You can see that there are no sources. Just pattern noise.
I think the fix for this is 2-pronged.
Put guard rails on the maxium "tweak" allowed. Say several arcsec.
Reduce memory used in lines 216-219 as Jonathan mentions above.
During NO11 we saw this same error for the fileset jw00683-o001_20210720t154324_image3_002
2021-07-24 04:52:12,111 - stpipe.Image3Pipeline.resample - INFO - Drizzling (2048, 2048) --> (8634, 19038)
----------------------------------------------------------------------
ERROR RUNNING STEP 'Image3Pipeline':
Unable to allocate 19.6 GiB for an array with shape (32, 8634,
19038) and data type float32
----------------------------------------------------------------------
Traceback (most recent call last):
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/bin/strun", line 28, in <module>
step = Step.from_cmdline(sys.argv[1:])
Comment by Howard Bushouse on JIRA:
Fixed by #6252 and #6251
Comment by Brian Hayden on JIRA:
We saw an error during LRE5 that we thought was similar to this: jw00821013001_02102_00001_nrs2
----------------------------------------------------------------------
ERROR RUNNING STEP 'DarkPipeline':
Unable to allocate 4.35 GiB for an array with shape (200, 1458176)
and data type complex128
----------------------------------------------------------------------
Documenting here to avoid making a whole new ticket.
Comment by James Davies [X] on JIRA:
The DarkPipeline
error is unrelated to the issue in this ticket, as darks don't run through resample or level 3 pipelines.
I would open a new ticket (not in JP) for this, as it's a memory error in DMS, likely not an issue with CAL software, allocating 4.35 GiB for an array with shape (200, 1458176) and data type complex128 is quite easy on my laptop and should be doable in DMS. So yet to be determined. The relevant traceback:
----------------------------------------------------------------------
ERROR RUNNING STEP 'DarkPipeline':
Unable to allocate 4.35 GiB for an array with shape (200, 1458176)
and data type complex128
----------------------------------------------------------------------
Traceback (most recent call last):
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/bin/strun", line 28, in <module>
step = Step.from_cmdline(sys.argv[1:])
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/stpipe/step.py", line 173, in from_cmdline
return cmdline.step_from_cmdline(args)
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/stpipe/cmdline.py", line 339, in step_from_cmdline
step.run(*positional)
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/stpipe/step.py", line 407, in run
step_result = self.process(*args)
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/jwst/pipeline/calwebb_dark.py", line 84, in process
input = self.refpix(input)
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/stpipe/step.py", line 407, in run
step_result = self.process(*args)
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/jwst/refpix/refpix_step.py", line 49, in process
result = irs2_subtract_reference.correct_model(input_model,
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/jwst/refpix/irs2_subtract_reference.py", line 159, in correct_model
data0 = subtract_reference(data0, alpha, beta, irs2_mask,
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/jwst/refpix/irs2_subtract_reference.py", line 717, in subtract_reference
r0[k, :, :] = np.fft.ifft(r0f[k], axis=1) * normalization
File "<__array_function__ internals>", line 5, in ifft
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/numpy/fft/_pocketfft.py", line 314, in ifft
output = _raw_fft(a, n, axis, False, False, inv_norm)
File "/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/numpy/fft/_pocketfft.py", line 70, in _raw_fft
r = pfi.execute(a, is_real, is_forward, fct)
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 4.35 GiB for an array with shape (200, 1458176) and data type complex128
2021222050059-E-ERROR-/dms/local/jwst/pipeline/pkgs/miniconda3/envs/jwstdp-1.2.3.20210609-py38/lib/python3.8/site-packages/jwst/pipeline/calwebb_dark.cfg FAILED (exit=1) on jw00821013001_02102_00001_nrs2_uncal.fits.
2021222050059-D-DIAG-retry_index into retry interval array = 5
2021222050059-E-ERROR-Memory Error likely. Waiting 3600 seconds before retry# 1...
2021222060059-I-INFO-loopcount is 2
Issue JP-2152 was created on JIRA by Jonathan Eisenhamer:
In DMS,
calwebb_image3
crashes with memory issues inresample
; see attached log. The memory footprint at time of failure is ~65GB.The command to use is
strun calwebb_image3 jw00624-o029_20210610t121017_image3_003_asn.json
When running on a local macos laptop, the processing completes, but the following behavior is seen:
resample
, during the drizzling, as noted by the output, the memory usage does not significantly change.Implication that something, after drizzle completes for each array, consumes massive amount of memory. Presumption is that this is at the failure line: resample.py:219