Open stscijgbot-jp opened 2 months ago
Ned Molter Can you give an example of some coron data that can't currently be processed through coron3? Have they never been able to be processed?
Comment by Ned Molter on JIRA:
Thanks to Hien Tran see JWSTDMS-921 linked to this ticket
Comment by Ned Molter on JIRA:
Sounds good. It looks like there are additional observations experiencing similar problems listed in the comments of that ticket. The comment from Howard that incorrect pointings might be making resample attempt to make a huge mosaic is probably beyond the scope of what the pipeline would handle, but it initially appears that it's possible to avoid loading all science asn members into memory at once in calwebb_coron3 using ModelLibrary, so it seems worth trying out
Comment by Ned Molter on JIRA:
It appears that the reason for such a high memory usage is the fact that a 4-dimensional QuadModel (which gets saved as the psfalign product) is allocated, with shape (n science integrations, n psf integrations, nrows, ncols). This gets very large: using just 2 of the 9 input PSF CubeModels, this was already taking up 500 GB of memory.
At least at first blush it appears that this should be fixable without fundamental changes to the processing algorithms: the QuadModel is originally populated with data while looping over the zeroth axis on this line, and then when it's accessed by klip, it's again while iterating over the zeroth axis on link title It therefore appears possible to keep everything stored on disk as a library of CubeModels and load them only when necessary, with data volume ~1 GB per model.
Modifying the pipeline to do this bookkeeping would be a bit tedious, and the psfalign intermediate product would be an association of CubeModels instead of a QuadModel, but it appears that substantial memory savings would be possible.
David Law does this seem worth pursuing further? Do you (or does anyone) know who I could ask questions about these algorithms, i.e., to make sure there's no fundamental reason the entire dataset needs to be in memory at once?
edit: I should say that the input PSF cubemodels are only 600 MB each, for a total of 4.5 GB. So it definitely seems like we had better be able to avoid allocating 500 GB here...
Comment by Jonathan Aguilar on JIRA:
Your proposed solution seems reasonable to me, if that ends up being the lowest-effort solution.
Speaking for MIRI (though these probably apply to NIRCam as well), there are several strategies one could use, though I haven't had time to test them. These are in my rough order of preference:
cal
products, instead of calints
products, as inputs to stage3As far as the alignment is concerned, this is equivalent to computing the mean for each exposure cube as described above This represents a tradeoff in how the wings of the PSF are modeled - greater statistical diversity but lower SNR, vs higher SNR but fewer samples. We would need to document this decision. ** It will affect the shape of subsequent coron3 data products.
Personally, I think that Option 1 would be the best option, as it would improve the SNR on the PSF wings and speckles that are used to compute the relative offsets. However, it would only reduce the data usage during the psf_align step, not in later steps. You would still load the full 4-dimensional cube during the klip
stage when the PSF model is computed. Only Option 4 gets around that problem.
Thanks for the input Jonathan Aguilar . Even in the cases that don't have an Alpha-Cen level of memory usage it sounds like there are some large performance improvements that can be realized here. I think the best thing to do at this point is put this into the list of potential improvements to be prioritized for the build 11.2 work starting later this month (11.1 work is closing in the next 2 weeks).
Prior to that point, I'd like to tag Julien Girard and Marshall Perrin for their input. To what degree is this impacted by https://github.com/spacetelescope/jwst/pull/8643, which revised the coron3 alignment step to use just the first integration for alignment?
Comment by Marshall Perrin on JIRA:
I hadn't realized that there was already the PR 8643 to revise the alignment step to only use the first integration. Already merged to develop in fact. I concur that's directly relevant. I think that's effectively Jonathan's option #2 from his list of possible strategies.
But having not looked at the code in that PR at all myself I have to admit I'm not sure if it reduces the dimensionality sufficiently, for the reasons that Jonathan gives.
Comment by Ned Molter on JIRA:
I've been looking into this a bit more this afternoon, and I don't think the way the shift is applied is directly relevant to the main memory issue. It appears to me that the align_refs step loops over all science integrations, applying a PSF shift to each one. Each of those 3-D shifted PSF cubes is stored in memory at the same time, taking the form of a QuadModel of shape (science_integrations, psf_integrations, rows, cols). Then, in the klip step, this QuadModel is accessed only within a for loop that loops over the zeroth (science_integrations) axis. Those shifted PSFs are then discarded and only a 3-D psf-subtracted science cube is returned.
Assuming I didn't miss some way in which the computations are inter-related, there's no reason why that 4-D array needs to be stored in memory. Either each of the shifted PSF cubes could be stored on disk between align_refs and klip, or align_refs and klip could be combined such that the shifted PSF cube need not be stored at all.
Comment by Ned Molter on JIRA:
I agree that this should be targeted for Build 11.2, but I couldn't bring myself to pivot away from this right at the end of the day...
I made a local branch with a hack-y version of my proposed fix. I'm attaching memory profiling plots of running the MIRI regression test dataset through calwebb_coron3 on the master branch, then on my branch. The total size of the input data (PSFs + science) is just 60 MB. During the align_refs step the memory usage is ~550 MB on master and ~250 MB on my branch. Looking at the flamegraph, the 300 MB difference is due primarily to allocating 220 MB to this QuadModel. I would expect the size of this array to be ~quadratic with total integration time, since it scales linearly with both the number of science integrations and PSF integrations.
The test_miri_coron3.py regression test passes as well for this branch, which runs the same dataset.
Comment by Ned Molter on JIRA:
I now realize that #8643 is indeed relevant here, because since that change, it seems that all of the planes of that 4-D PSF model are identical; that is, the same shift is applied to the same PSF for all the science integrations, and there's therefore no need to save a different PSF for each science integration at all (in memory or not). The psfalign product should therefore be a 3-D CubeModel instead of 4-D QuadModel. The fact that this is still being saved this way is in some ways a bug introduced by that PR, or at least an unintended consequence of that PR.
Jonathan Aguilar can you say more about "You would still load the full 4-dimensional cube during the klip
stage when the PSF model is computed."? This is not true for the current pipeline version: all processing in the klip step is done inside a for
loop over science integrations starting on this line: https://github.com/spacetelescope/jwst/blob/c2d87f7a2c8c4e383b0b4c961d802205ca0b280f/jwst/coron/klip.py#L42
David Law Now that we're getting to the bottom of this and realizing that the memory usage is caused by hundreds of identical copies of a shifted PSF, do you think a fix for that still needs to be put off until 11.2? The fix would be straightforward and does not affect the output products, except that the psfalign
intermediate product would be a CubeModel instead of a QuadModel. And, separately, should the work remain under this ticket, since it really has nothing to do with ModelLibrary?
Ned Molter Since it sounds like this is a straight forward and limited change rather than a general refactor of coron3 I think it's fine to look into for 11.1. Please go ahead and file a specific ticket for it. We can keep this ticket as On Hold for any additional work that might happen in 11.2.
A followup on your comment about the klip stage: looking at the code I see that there's indeed a loop over science integrations, but this loop simply accesses a QuadModel that's been passed in containing the 4D stack of reference images. Presumably the full 4D cube is thus loaded all in one go?
Comment by Ned Molter on JIRA:
Thanks, I will file a separate ticket and leave this one on hold.
Yes, the entire QuadModel is loaded in one go, but in one of the first lines of the for loop the QuadModel is indexed on the zeroth axis, so the processing itself occurs just on the i
th frame of that model. (And those are all identical)
Issue JP-3724 was created on JIRA by Ned Molter:
In some cases, the data volume of coronagraphic observations is large enough that they cannot be processed through calwebb_coron3. Loading all observations into memory at once should therefore be avoided if/where possible. ModelLibrary was purpose-built to handle cases like this, and its implementation in calwebb_image3 did yield substantial memory usage improvements. The same should be attempted for coronagraphy.