spacetelescope / jwst

Python library for science observations from the James Webb Space Telescope
https://jwst-pipeline.readthedocs.io/en/latest/
Other
561 stars 167 forks source link

JP-3690: Switch from ModelContainer to ModelLibrary for image3 pipeline #8683

Closed emolter closed 3 weeks ago

emolter commented 2 months ago

Resolves JP-3690 Resolves JP-3619 Resolves JP-3620 Resolves JP-3621, see JP-3707 for additional work on resample memory usage beyond the scope of this ticket. Resolves JP-3498 Work is under the epic JP-3602.

Fixes #8649 Fixes #8478 Fixes #8479 Fixes #8480 Fixes #8164

This PR is part of a larger effort to improve memory usage throughout the pipeline. Here, the AbstractModelLibrary class in stpipe is subclassed for JWST, and the pipeline steps that form the calwebb_image3 pipeline are updated to use ModelLibrary instead of ModelContainer. This facilitates ensuring that the pipeline step runs the same way whether models are loaded into memory or saved to disk.

When ModelLibrary is used with on_disk=True, memory usage is lower, both for individual pipeline steps and for calwebb_spec3 as a whole. An analysis of memory usage for each step can be found in the linked tickets JP-3619, JP-3620, and JP-3621. An analysis of the memory usage as a whole, when run on a large dataset, can be found in JP-3690.

Checklist for PR authors (skip items if you don't have permissions or they are not applicable)

codecov[bot] commented 2 months ago

Codecov Report

Attention: Patch coverage is 89.09396% with 65 lines in your changes missing coverage. Please review.

Project coverage is 60.77%. Comparing base (8381a26) to head (364cd60). Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
jwst/tweakreg/tweakreg_step.py 69.23% 20 Missing :warning:
jwst/pipeline/calwebb_image3.py 71.05% 11 Missing :warning:
jwst/assign_mtwcs/moving_target_wcs.py 82.35% 6 Missing :warning:
jwst/outlier_detection/spec.py 64.28% 5 Missing :warning:
jwst/resample/resample.py 96.40% 5 Missing :warning:
jwst/assign_mtwcs/assign_mtwcs_step.py 55.55% 4 Missing :warning:
jwst/datamodels/library.py 94.36% 4 Missing :warning:
jwst/outlier_detection/utils.py 96.10% 3 Missing :warning:
jwst/pipeline/calwebb_coron3.py 33.33% 2 Missing :warning:
jwst/pipeline/calwebb_spec3.py 0.00% 2 Missing :warning:
... and 2 more
Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #8683 +/- ## ========================================== + Coverage 60.63% 60.77% +0.14% ========================================== Files 372 373 +1 Lines 38372 38596 +224 ========================================== + Hits 23267 23458 +191 - Misses 15105 15138 +33 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

emolter commented 2 months ago

Initial regtests started here just to find (probably many) failures.

more regtests here for same reason, after lots of changes to memory allocation

new regtests after some bugfixes

another round of regtests fixing more bugs

emolter commented 1 month ago

Latest round of regression tests here. These unfortunately include 50 failures that are all related to the delivery of new MIRI reference files and other MIRI changes on the INS side. None of the failures from the previous regtest run are still present, so I'm pretty confident this PR is passing all regtests now and is ready for review

braingram commented 1 month ago

I've finished viewing and commenting on all the files in this PR. Overall I think the changes look great and the comments are mostly minor. Thanks for taking on this work! I'll follow up with comments but please let me know if there's more I can do to help to move this PR along.

emolter commented 1 month ago

new set of regtests here

braingram commented 1 month ago

I ran a 972 member association (~100GB input data) through calwebb_image3 using this PR (an "on disk" library was used as input and a slightly older commit 26e543627d9ed8e28af6475f0f3fe7f8ab7862cf). The pipeline succeeded and the recorded memory usage (using memray) was as follows:

Screenshot 2024-08-20 at 9 38 32 AM

I will add this information and a link to the memory profile to the jira ticket. The peak memory usage was 50GB and this is largely due to the context array generated during resample. Importantly for this PR, at no point does the pipeline load all input data into memory.

emolter commented 1 month ago

testing just test_nircam_image.py here, because it takes forever on my local machine

emolter commented 1 month ago

Most recent regtests for test_nircam_image only are finally passing after fixing file naming issues. The same fix should apply for the three MIRI image3 failures here. I'm going to wait on another full regtest run until after the next round of comments.

braingram commented 1 month ago

Thanks again for all your work on this. I went through the files and left a new set of comments. It looks very close to done.

emolter commented 1 month ago

New round of regression tests started here after responding to comments from Brett's second review.

edit: file name issues recurred, apparently the small bug setting pool and table names in ModelLibrary was not the cause. Need to put back in the manual setting of model.meta.filename. Starting another round here

emolter commented 1 month ago

Another round of regtests after all the nircam image and miri image failures are fixed is here. Fingers crossed that the only failures are the same ones that show up on the nightly runs

I believe that all the regression tests are unrelated

mcara commented 1 month ago

Why there is still usage for the ModelContainer? Is ModelLibrary not completely replacing ModelContainer? Is it only for backward compatibility to support ModelContainer inputs?

emolter commented 1 month ago

Why there is still usage for the ModelContainer? Is ModelLibrary not completely replacing ModelContainer? Is it only for backward compatibility to support ModelContainer inputs?

We decided to only replace container with library in the calwebb_image3 pipeline for now, where the memory performance gains are most important. There are other complications with replacing this for the spectroscopic modes too, e.g., the SourceModelContainer class that uses ModelContainer as a parent class needs to be replaced also

emolter commented 1 month ago

Another round of regression tests have been started here after fixes per Melanie's review.

Edit: lots of failing tests, but they match last night's nightly run