spacetelescope / jwst

Python library for science observations from the James Webb Space Telescope
https://jwst-pipeline.readthedocs.io/en/latest/
Other
570 stars 167 forks source link

resampling during outlier_detection generates unused products #8638

Closed stscijgbot-jp closed 3 weeks ago

stscijgbot-jp commented 4 months ago

Issue JP-3685 was created on JIRA by Brett Graham:

When resampling is performed during outlier detection, the underlying GWCSDrizzle generates a context array which is unused. This array can be quite large, for N 2d inputs of size H x W the context array will be of size H x W x ((N // 32) + 1). Investigate disabling this (and any other unnecessary intermediate products) to save on memory and computation during outlier detection.

stscijgbot-jp commented 1 month ago

Comment by Mihai Cara on JIRA:

The code that would allow this is pending in a couple of PRs in drizzle and jwst packages. However, there is an issue with models that creates context arrays regardless. Specifically, here is an example:

 

In [1]: from stdatamodels.jwst.datamodels import ImageModel
In [2]: m = ImageModel((5, 5))
In [3]: m.data
Out[3]:
array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]], dtype=float32)
In [4]: m.con
Out[4]:
array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]], dtype=int32)
In [5]: m.con = None
In [6]: m.con
Out[6]:
array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]], dtype=int32)
In [7]: del m.con
In [8]: m.con
Out[8]:
array([[0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0],
       [0, 0, 0, 0, 0]], dtype=int32)

So, until data model are not modified in a way as to allow some arrays to be None, I am afraid these arrays will always re-created.

 

braingram commented 1 month ago

The con array is created when it accessed. This can be seen by looking at the instance attribute of the datamodel:

>> from stdatamodels.jwst.datamodels import ImageModel
>> m = ImageModel((5, 5))
>> 'con' in m.instance
False
>> m.con
>> 'con' in m.instance
True

There are more details in the documentation about the datamodels: https://stdatamodels.readthedocs.io/en/latest/jwst/datamodels/structure.html#the-structure-of-datamodels

mcara commented 1 month ago

Correct. But that means we cannot completely get rid of this memory: H x W x 1 is the minimum.

braingram commented 1 month ago

If we never assign to con it will never be allocated (making the minimum 0).

stscijgbot-jp commented 1 month ago

Comment by Ned Molter on JIRA:

I'm attaching memory profiling before (outlier_mihai.html) and after (outlier_mihai_2.html) the change that removed allocation of the context array.  It does appear that the context array is no longer allocated, so I think this ticket can be resolved once that PR is merged.

The new peak memory usage shows that sigma_clip is using a factor of ~4 more memory than the weight array input into it.  This does beg the question of whether that, too, can be optimized.  But I think this should be its own ticket, especially given that the offending line lives in stcal. https://github.com/spacetelescope/stcal/blob/dfe1d6d51f0c2a400dff918a52c779328ae19b7d/src/stcal/outlier_detection/utils.py#L80 

stscijgbot-jp commented 1 month ago

Comment by Ned Molter on JIRA:

Attaching two more flamegraphs here, showing the effect of my fix to AL-875 along with Mihai's PR.  The memory usage for on-disk=True decreased by 25% from ~4 GB to ~3 GB.

The file names are outlier_before_ondisk.html and outlier_AL875_mihai_ondisk.html for the before and after cases, respectively.

 

stscijgbot-jp commented 3 weeks ago

Comment by Melanie Clarke on JIRA:

Fixed by #8866