Re-evaluate usages of ModelContainer in light of ModelLibrary

stscijgbot-jp commented 2 months ago

Issue JP-3715 was created on JIRA by Ned Molter:

ModelLibrary is now implemented for all steps relevant to calwebb_image3, but the ModelContainer class is still used in many locations throughout the pipeline, especially in e.g. calwebb_spec3 and its steps. The two classes are meant to do the same thing, i.e., manipulate associations, so now is the time to evaluate:

What should ModelContainer look like now that ModelLibrary handles associations while being careful with memory usage? should it be removed?
Where in the pipeline can ModelContainer be replaced with simple lists of datamodels, and where is it useful to have the association metadata?
Are there additional workflows/pipelines other than calwebb_image3 where saving memory by avoiding loading all models into memory would be useful, i.e., where ModelLibrary should be implemented?
Can SourceModelContainer be removed from the pipeline and deprecated?

stscijgbot-jp commented 2 months ago

Comment by Ned Molter on JIRA:

I've run into a few issues I need clarity on in order to finish this. Tagging Melanie Clarke Brett Graham Tyler Pauly Nadia Dencheva, not sure who else should be in this discussion.

With ModelContainer going away, what should be the default data structure into which we load association-type input? ModelLibrary is the only one that would hold the datamodels and their metadata, so it's an obvious choice, but Melanie was of the opinion (and I think there's something to this) that it would be annoying and confusing to users to have to work with a structure that doesn't act like a list the way ModelContainer did. She suggested loading the models into a simple list by default, but there are two potential issues with that, too: first, the asn metadata is lost. We can get around that by loading a list and a dict. The second problem is that lists can't be used in a with context, meaning that the typical suggested workflow of with dm.open(file_name) as model: would no longer work with asn-type input. Thoughts?
Can someone give me a brief history of the SourceModelContainer? It appears that it must load all models into memory in order to be instantiated, and it's not obvious how to get around that for a ModelLibrary version, which means that SourceModelLibrary is just a fancy list of SlitModels that exists so that the syntax is the same when processing MultiSlitModels through the pipeline. Maybe that's okay, but I'm also wondering whether I should play with turning these into simple lists instead.
Related to both of the above, we have the option to process spectroscopic data through calwebb_spec3 passing ModelLibrary between all the steps, or simply passing a list of models (it used to be ModelContainer, of course). On one hand, the ModelLibrary may give us similar memory improvements, and the pipeline flow would be easier because outlier detection and resampling already expect ModelLibrary. On the other hand, lists might be easier for users and most of the other steps could function just fine with one or more lists as input.
Another option that should be taken seriously is to retain ModelContainer and withdraw this ticket. ModelContainer does everything we would want for calwebb_spec3 already, and that pipeline is not running into memory usage issues the way calwebb_image3 is.

stscijgbot-jp commented 2 months ago

Comment by Brett Graham on JIRA:

I'm not opposed to keeping ModelContainer if it's easier for users. It would be good to get some feedback on what users currently do with ModelContainer and if those operations would work easily with ModelLibrary before committing to keeping a container that could otherwise be removed from the pipeline.

If we do keep ModelContainer we should remove the confusing and dangerous options (save_open, return_open, etc) as those can result in silent overwriting of input files. I believe with your PR we can also remove get_sections and the metadata patching. However we would need to consider what we expect to happen if a user tries to feed a ModelContainer into (for example) image3. Do we treat it as a list of models? Do we read the association data? Do we repeat the metadata patching? We could disallow this and require the use to save an association and the models before calling the pipeline but as there is no way to save a ModelContainer that writes the association this seems more confusing and annoying than just using ModelLibrary.

I have not looked at the non-image3 pipelines. Although I did hear that coron3 associations have failed processing in ops due to resource limits on what they called "mecha-godzilla" nodes (with 1 TB ram and 1 TB swap).

Unfortunately, I don't know the history of SourceModelContainer.

stscijgbot-jp commented 2 months ago

Comment by Melanie Clarke on JIRA:

Stripping ModelContainer down to just behave like a simple list of models + metadata that can be called in a with context sounds like a good option to me.

It also sounds fine to me to disallow ModelContainer as input to image3, or anywhere else that a simple list of models isn't appropriate. I think it's well understood that the input to stage 3 pipelines needs to be an association file. I agree that if complex metadata or memory handling is needed, it's better to just use a ModelLibrary.

stscijgbot-jp commented 2 months ago

Comment by Ned Molter on JIRA:

After our discussion today, it sounds like everyone was in rough agreement that we should at least try out slimming down ModelContainer instead of removing it entirely. Ticket JP-3721 has been created to track potential changes to ModelContainer.

This ticket needs to be decreased in scope too. It now covers replacing ModelContainer with ModelLibrary in calwebb_spec3 in places where this would be substantially beneficial. We need to figure this out for individual steps on a case-by-case basis.

stscijgbot-jp commented 2 months ago

Comment by Mihai Cara on JIRA:

I do not see what's wrong with ModelContainer (besides some confusing parameters like save_open or something like that). I think trying to get it to work as a context manager will just complicate things and I thought we wanted it for simplicity of use.

Ideally, IMO, ModelContainer should return (or have an option to return) a list of dictionaries like this:


[{'file_name': 'file.fits', 'asn_meta': {'group_id': 1, 'sky_value': 2, ...}}, ...]```

stscijgbot-jp commented 2 months ago

Comment by Ned Molter on JIRA:

Mihai Cara It sounds like the information you want already mostly exists in the plaintext of the .json file. If that is desired as a dict, it's accessible from the current ModelContainer (container.meta.asn_table) or from ModelLibrary (library.asn).

Please see the draft specifications I laid out in the ticket specific to slimming down ModelContainer at JP-3721.

I think if the desired behavior is to keep models on disk and only load them into memory when necessary, ModelLibrary should typically be used. If you want to do it a different way for some reason, the filenames can be read from the json file by whatever means you wish. My view is that the new-look ModelContainer should basically just be a list of datamodels in memory and some association metadata like the exposure types, along with convenience methods to e.g. split models by exposure type or group_id. The justification for this is primarily that it's easy for users (and developers, when memory usage is not a concern) - there's no need to worry about opening/closing datamodels, and the container can be used just as a regular Python list would.

ModelContainer already works as a context manager in its current form because it inherits from DataModel. The inheritance would be removed in my proposed re-work of ModelContainer, but context management is still desired because ModelContainer would still be the default way to open association-type data with datamodels.open(). The only thing the context management would do is to close all the models when exiting the with context.

Let's please continue this discussion on https://jira.stsci.edu/browse/JP-3721

spacetelescope / jwst

Re-evaluate usages of ModelContainer in light of ModelLibrary #8724