Open stscijgbot-jp opened 1 month ago
Comment by Jesse Doggett on JIRA:
I'm not sure we have the answers to your questions above. It might be good to have a meeting with Hien Tran and Eric Barron to make sure we understand the questions and how we might go about finding the answers.
Comment by Jesse Doggett on JIRA:
I started our godzilla test job. It has two associations, jw01568-o002_20240911t173952_image3_00001/2. All related files are available on tljwdmscsched2.stsci.edu. The input, output, and association files are in:
The _00001 association has completed and its log files are in:
The _00002 association just started. It was submitted last night, but we didn't have a large enough machine configured for it. We got one set up and it's running now. It's log files are in:
Comment by Ned Molter on JIRA:
That would be great Jesse, I'll send a message to all of you tomorrow to schedule something for early next week. And thanks for starting the test I will look tomorrow.
David, yes it would be different. Right now, the default is in_memory=False
which means that (among other things) in outlier detection the median is computed piecewise in sections to save memory. I agree that the end user experience is a concern. The other memory improvements that went along with implementing ModelLibrary will help this somewhat, but for larger associations it is still certainly easy to go over 16 GB with in_memory=True
. On the flip side, the runtimes are faster when everything is in memory, so that option might be preferred when users have relatively small associations.
If, theoretically, we did want to have the parameter False
in ops but True
for an end user, is that possible to do?
Comment by Ned Molter on JIRA:
A group of us including Brett Graham Tyler Pauly Jesse Doggett Eric Barron Hien Tran just had a preliminary discussion about this. The take-aways were:
Have been following along on this thread and others, as well as brief discussion with Brett G. As Ned mentions in the comment from 9/16, we will have to live with the fact that the model is likely going to underestimate a fair amount of datasets that now use more memory than they used to, and plan to retrain the model on data reprocessed using new code changes, and get that into the next release.
One thought I have (not sure how feasible this is, or if it's even a priority): The only way around this would be if there was some way to estimate the rate of increased memory usage incurred by the code change (in_memory=True). Then you can just use that rough estimate on top of the model prediction to act as a buffer. In other words, the model predicts 80GB for dataset X, and we can calculate (roughly) dataset X is going to use 20-50 more GB than it used to because of in_memory=True, so we put it on the next biggest node instead. Whereas dataset Y is estimated at 25GB, so adding 20-50GB more to that results in using the same node size (100GB) as before. That would theoretically prevent a lot memory fails/rescues from occurring during that interim period.
Comment by Ned Molter on JIRA:
Ru Kein Tyler Pauly I just attached a script to this ticket that attempts to guess the memory usage of outlier detection and resample without running those steps. At present, it's able to calculate the memory usage to within, say, +/-30% of the actual usage. The way it works is to basically figure out the size of the resampled array based on the s_regions of all the inputs plus a pixel scale, then basically account for all allocations by hand, which is tractable because all the important ones are integer multiples of either the input data size or the output data size.
This is definitely a work in progress, and it's also a moving target: some open pull requests will modify the memory usage of these steps. However, I think this is much more straightforward, and I'm guessing accurate, than using a machine learning algorithm, and I'm hoping that after Build 11.2 delivery these steps will be in a more stable state. Please let me know what you think, and whether this appears useful enough for us to discuss next steps.
I encourage you to try this out on your own datasets. I'm sure there are bugs and ways to improve the accuracy of the estimate.
(note: you will need to install specific branches of stcal and jwst to get this to run: instructions in the docstring at the top)
(note: you can disable the plot by setting save_plot="")
Ned Molter are you able to provide a list of the datasets you tested this on, along with the estimates vs actual memory usage?
Comment by Ned Molter on JIRA:
I've only tested this on one dataset so far, which is a subset of a large nircam mosaic. See attached asn json file. [^small_asn.json]
The output from the script for in_memory=False mode was:
Estimated peak memory usage for OutlierDetectionStep: 4.987201224677027 GB
Estimated peak memory usage for ResampleStep: 11.490969524246395 GB
True peak memory usage for OutlierDetectionStep: 4.020512127317488 GB
True peak memory usage for ResampleStep: 10.445950175635517 GB
Please note, it looks like we are going to change the API of git+[https://github.com/emolter/stcal@AL-837] before it is merged to main. I will provide an updated script at that time. Sorry that this is still in flux, but I wanted to share my progress anyway so we can decide how to move forward.
Issue JP-3744 was created on JIRA by Ned Molter:
In a discussion with Jesse Doggett it was realized that ops typically prefers to process everything in memory when possible, even if it requires a high-memory node, because of potential issues with temporary file I/O.
In light of this, it may make sense to make
in_memory=True
the default for calwebb_image3 and in particular outlier detection.The effects of this would be:
in_memory
to False to allow these to be processed successfullySome questions we had for Hien Tran or other members of Team Coffee are:
in_memory=False
mode of calwebb_image3 relies on reading and writing lots of temporary files, so if there's no fast I/O this will cause long runtimes.