spacetelescope / jwst

Python library for science observations from the James Webb Space Telescope
https://jwst-pipeline.readthedocs.io/en/latest/
Other
570 stars 167 forks source link

Re-evaluate usages of ModelContainer in light of ModelLibrary #8724

Open stscijgbot-jp opened 2 months ago

stscijgbot-jp commented 2 months ago

Issue JP-3715 was created on JIRA by Ned Molter:

ModelLibrary is now implemented for all steps relevant to calwebb_image3, but the ModelContainer class is still used in many locations throughout the pipeline, especially in e.g. calwebb_spec3 and its steps.  The two classes are meant to do the same thing, i.e., manipulate associations, so now is the time to evaluate:

stscijgbot-jp commented 2 months ago

Comment by Ned Molter on JIRA:

I've run into a few issues I need clarity on in order to finish this.  Tagging Melanie Clarke Brett Graham Tyler Pauly Nadia Dencheva, not sure who else should be in this discussion.

stscijgbot-jp commented 2 months ago

Comment by Brett Graham on JIRA:

I'm not opposed to keeping ModelContainer if it's easier for users. It would be good to get some feedback on what users currently do with ModelContainer and if those operations would work easily with ModelLibrary before committing to keeping a container that could otherwise be removed from the pipeline.

If we do keep ModelContainer we should remove the confusing and dangerous options (save_open, return_open, etc) as those can result in silent overwriting of input files. I believe with your PR we can also remove get_sections and the metadata patching. However we would need to consider what we expect to happen if a user tries to feed a ModelContainer into (for example) image3. Do we treat it as a list of models? Do we read the association data? Do we repeat the metadata patching? We could disallow this and require the use to save an association and the models before calling the pipeline but as there is no way to save a ModelContainer that writes the association this seems more confusing and annoying than just using ModelLibrary.

I have not looked at the non-image3 pipelines. Although I did hear that coron3 associations have failed processing in ops due to resource limits on what they called "mecha-godzilla" nodes (with 1 TB ram and 1 TB swap).

Unfortunately, I don't know the history of SourceModelContainer.

stscijgbot-jp commented 2 months ago

Comment by Melanie Clarke on JIRA:

Stripping ModelContainer down to just behave like a simple list of models + metadata that can be called in a with context sounds like a good option to me.

It also sounds fine to me to disallow ModelContainer as input to image3, or anywhere else that a simple list of models isn't appropriate.  I think it's well understood that the input to stage 3 pipelines needs to be an association file. I agree that if complex metadata or memory handling is needed, it's better to just use a ModelLibrary.

stscijgbot-jp commented 2 months ago

Comment by Ned Molter on JIRA:

After our discussion today, it sounds like everyone was in rough agreement that we should at least try out slimming down ModelContainer instead of removing it entirely. Ticket JP-3721 has been created to track potential changes to ModelContainer.  

This ticket needs to be decreased in scope too.  It now covers replacing ModelContainer with ModelLibrary in calwebb_spec3 in places where this would be substantially beneficial.  We need to figure this out for individual steps on a case-by-case basis.

stscijgbot-jp commented 2 months ago

Comment by Mihai Cara on JIRA:

I do not see what's wrong with ModelContainer (besides some confusing parameters like save_open or something like that). I think trying to get it to work as a context manager will just complicate things and I thought we wanted it for simplicity of use.

Ideally, IMO, ModelContainer should return (or have an option to return) a list of dictionaries like this:


[{'file_name': 'file.fits', 'asn_meta': {'group_id': 1, 'sky_value': 2, ...}}, ...]```
stscijgbot-jp commented 2 months ago

Comment by Ned Molter on JIRA:

Mihai Cara It sounds like the information you want already mostly exists in the plaintext of the .json file.  If that is desired as a dict, it's accessible from the current ModelContainer (container.meta.asn_table) or from ModelLibrary (library.asn).

Please see the draft specifications I laid out in the ticket specific to slimming down ModelContainer at JP-3721.

I think if the desired behavior is to keep models on disk and only load them into memory when necessary, ModelLibrary should typically be used.  If you want to do it a different way for some reason, the filenames can be read from the json file by whatever means you wish.  My view is that the new-look ModelContainer should basically just be a list of datamodels in memory and some association metadata like the exposure types, along with convenience methods to e.g. split models by exposure type or group_id.  The justification for this is primarily that it's easy for users (and developers, when memory usage is not a concern) - there's no need to worry about opening/closing datamodels, and the container can be used just as a regular Python list would.

ModelContainer already works as a context manager in its current form because it inherits from DataModel.  The inheritance would be removed in my proposed re-work of ModelContainer, but context management is still desired because ModelContainer would still be the default way to open association-type data with datamodels.open().  The only thing the context management would do is to close all the models when exiting the with context.

Let's please continue this discussion on https://jira.stsci.edu/browse/JP-3721