spacetelescope / jwst

Python library for science observations from the James Webb Space Telescope
https://jwst-pipeline.readthedocs.io/en/latest/
Other
565 stars 167 forks source link

Fix inconsistencies and problems in DrizProductModel and its use #4507

Closed stscijgbot closed 4 years ago

stscijgbot commented 4 years ago

Issue JP-1258 was created by Howard Bushouse:

We have several tickets open for problems that all trace back, in one way or another, to shortcomings or inconsistencies in the structure and content of the datamodels.DrizProductModel.

The most troublesome problem is the lack of certain meta data (keywords) not being defined in the DrizProductModel schema that are needed by some pipeline steps that use a DrizProductModel as input. These problems reported in JP-870 and JP-1144. A related ticket, JP-880, concerns meta data that is not or cannot be copied over from a DrizProductModel.

The schema for DrizProductModel also needs cleaning up, in order to remove ad hoc includes of sub-schemas like "lev3_prod.schema.yaml". The meta data contained in that schema should be reorganized to either move it to other schemas (e.g. core.schema) or create a more appropriate sub-schema that is relevant to all resampled products.

Related to this is simply the naming of DrizProductModel and the associated MultiProductModel. In particular, the name "MultiProductModel" is not very informative. Just what kind of "Product" is being referred to? We have lots of "products" created by the jwst pipeline. Both of these models should perhaps be renamed to something more obvious and suitable, such as "ResampledModel" and "MultiResampleModel" (other suggestions are welcome). And perhaps we may even need a model dedicated exclusively to resampled spectral data, because it typically requires lots of additional meta data (slit-related info) that is irrelevant to resampled image data.

stscijgbot commented 4 years ago

Comment by Jane Morrison: [~jdavies] [~bushouse] I am going to try and tackle this ticket. Before I start adding to the DrizProductModel to add slit name information (JP-870) I wondered if you thought first if we should create a DrizProduct for spectral data - SpecResample or something like that. Also is the only place the MultiProductModel is used is in spectral data ? Doing a grep of MultiProductModel in the jwst directory brings up resample_spec_step.py, extract_1d.py. I am just wondering how best to start this ticket. If we reorganize DrizProductModel -> Two models: ResampleModel & SpecResampleModel MultiProductModel -> MultiSpecResampleModel Something like that ?

philhodge commented 4 years ago

Couldn't we drop DrizProductModel and just use MultiProductModel? I would like to see fewer models, if possible. Also, I don't see why we would want separate models for spectral data and imaging data, if they both contain 2-D (or 3-D) arrays.

jdavies-st commented 4 years ago

The only difference between an ImageModel and a DrizProductModel is the former has

├─meta (dict)
│ └─model_type (str): ImageModel
├─data (ndarray): shape=(10, 10), dtype=float32
├─dq (ndarray): shape=(10, 10), dtype=uint32
└─err (ndarray): shape=(10, 10), dtype=float32

and the latter

├─meta (dict)
│ └─model_type (str): DrizProductModel
├─data (ndarray): shape=(10, 10), dtype=float32
├─con (ndarray): shape=(10, 10), dtype=int32
└─wht (ndarray): shape=(10, 10), dtype=float32

But there's no reason that con and wht couldn't be optional arrays in ImageModel. DrizProductModel also has some extra schema items from lev3_prod.schema.yaml, but again, most of these are not used, or not populated, and those that are could be in the core schema.

If that solution is acceptable, the same could be done for folding con and wht as optional arrays into MultiSlitModel.

Finally, there might not be a need for con and wht eventually. It would be much better to have wht converted into a useful err array and do the weighting by inverse variance (or some global, non-pixel-to-pixel approximation for now), which would then mean err and wht would be the same.

For resampled single detector images, the wht and ctx arrays are meaningless, and it would be much better to use the existing err and dq arrays in ImageModel. Just drizzle a mask formed from the dq array and the err array, which of course does not account for correlated noise, but would at least be a lower-bound placeholder until we figure out how to do this properly.

Anyway, just some thoughts.

jdavies-st commented 4 years ago

Also agree with Phil. I'd like to see fewer models if possible.

stscijgbot commented 4 years ago

Comment by Howard Bushouse: While I like the elegance of the proposal by [~jdavies] to just add con and wht as optional arrays to the existing ImageModel and MultiSlitModel, I (just my personal opinion) also like the idea of having at least one model that is explicitly for resampled data. That way it's obvious as soon as the data are loaded into the model that you're working with resampled data (just based on the model type). You don't have to check other indicators, such as whether the resample or resample_spec steps have been applied or whether the con and wht arrays exist.

In addition to making con and wht optional in ImageModel, you'd also then have to make err and dq optional, so that they don't get populated for resampled data. And right now I believe all the models are setup to always automatically inflate and populate the err and dq arrays, even if they don't exist in the input data. So we'd want to change that (no need to be carrying around extra unused baggage).

But the thought of optional attributes in general has made me realize that [~hodge] is probably right, that we don't necessarily need separate models for resampled imaging and spectroscopic data. They both use the same data arrays; the only difference is the meta data they carry around. The individual meta data attributes are optional, hence the ones dealing with slit-related info just wouldn't be used (would be set to None) for imaging data and would not result in any keywords in output files.

So how about something like a generic ResampModel (or MultiResampModel)? For multi-slit cases it does need to contain multiple slit instances, but does that mean we necessarily have to have the phrase Multi in the name? Couldn't we just build the generic ResampModel to allow for multiple instances and only populate 1 of them when we're working with an image or a single slit (like the NIRSpec BrightObj case that currently has problems with DrizProductModel)?

stscijgbot commented 4 years ago

Comment by Jane Morrison: I vote for a generic 'ResampModel' that can have multiple instances and only populate 1 when working with image or single slit data and all multiple instances for MultiSlit.

How about I copy the DrizProdModel to ResampModel, remove the lev3_prod.schema.yaml and remove the meta data that is not used in this schema and move the meta data that is used to a logical place - either core schema or maybe even in ResampModel. I am a little unsure of what to do with tweakreg_catalog and source_catalog in this schema. Should that go into the core schema ? If the generic ResampModel sounds reasonable I will first tackle the image case, then the FS case and then multi slit. After each rework I will submit PR for review and comment. Ok ?

stscijgbot commented 4 years ago

Comment by Jane Morrison: Looking at the MultIProduct.schema - there are many values (bunit_data, name, xstart) that are usually in a 'meta' in a schema. I am wondering since we want a generic ResampModel to work as a replacement for DriZProduct and MultiProduct if I should first create a ResampModel with data, wht, con and a meta containing a combination of what is in lev3_prod.schema and multiproduct.schema

stscijgbot commented 4 years ago

Comment by Howard Bushouse: Regarding the meta attributes in lev3_prod.schema, first do a global search of the code base to find out which ones are actually used (or not), so that we can get rid of the unused ones. I think some of them were put in there way back when the drizzle routine was first ported over to the jwst pipeline (from HST). So all of the keywords that are populated by the HST-based routine were just put in there in case we wanted to use them, and many haven't been used. I think pretty much all of the ones that are used can just be moved over to core. schema. We might want to create a new sub-section in the core schema for these level-3/combined product meta attributes.

stscijgbot commented 4 years ago

Comment by Jane Morrison: I have created a single MultiResampModel in the schema for this basically copied the drizproduct.schema and then added all the meta data from multiproduct and lev3_prod. Later I can pull some of this out to the core schema. But now I need to actually make multiresamp.py file - that defines the MultiResampModel. The idea is if it is for Imager data for FS there will be a single MultiResamp Model. If it multiple slits then there are multiple instances of MultiResampModel. I think this exactly how the IFUCube is done. There can be a container of IFUCubes. I have just gotten confused on the comments in the current multiprod.py (in data models) "This model has a special member 'products' that can be used to deal with each DrizProduct at a time. It behaves like a list.

Do we want it to be list or a ModelContainer for the multi slit case ?

stscijgbot commented 4 years ago

Comment by Jane Morrison: [~jdavies] [~bushouse] When I create the MultiResamp Data Model do I want to keep they way it was done in MultiProduct- with 'products' ? Or reconfigure how resample_spec uses this model and make a container of MultiResamp Data. Which then means I would need to change extract 1D to read it (but that should be easy).

Maybe I should push over what I have done so far to PR.

stscijgbot commented 4 years ago

Comment by Howard Bushouse: Regarding the list vs. ModelContainer question, I think it should be structured the same way MultiProductModel was structured, which I believe is also the same way that MultiSlitModel is structured. We want all data for one of these instances contained within a single data model, not a ModelContainer. ModelContainer is for a higher level of grouping, namely across multiple exposures.

Not sure what to call the multiple instances within the model. In MultiSlitModel they're called "slits". I guess we could at least start with "product" for now, just to get something to play with, but that name is too generic, IMHO, since everything is a product. If we define and use a ResampModel for cases where there's only one instance of data, which is the case for all resampled imaging data, and then use MultiResampModel for cases where there are multiple instances of data, the only time MultiResampModel should get used is for spectra, in which case the term "slits" is probably fine to use here (same as in MultiSlitModel). I'm thinking it would be best to have a dedicated ResampModel for the imaging case, where there's only one data instance, instead of having to modify all the existing imaging code to use the zeroth instance of a MultiResampModel.

philhodge commented 4 years ago

Why not call it "slits", so downstream code won't have to figure out what it's called.

stscijgbot commented 4 years ago

Comment by Jane Morrison: Just to be clear. We used to have a DrizProduct (which was for imager data) and a MutiProd model which was for for multiple slits. So I understood the original plan to merge these to into 1 data model. But I am not sure we have ended up with that idea. Instead it sounds like we are just renaming the DrizProduct to ResampModel (and removing the lev3_prod schema by either including it in ResampModel or core schema) ) And renaming the MultiProd model to MultResampleModel (we can change products to slits to make it clear). Does this summarize what we need ?

stscijgbot commented 4 years ago

Comment by Howard Bushouse: As nice as it would be to keep the number of data models to a minimum and hence just use MultiResampModel for all cases, I think it's also beneficial to have a simple model like ResampModel for those cases (i.e. imaging) where you don't need anything fancy and the code doesn't have to keep referring to model.slit[0] everywhere. So, in my opinion, I would like to see a ResampModel and a MultiResampModel. The sub-instances within MultiResampModel can be called "slits", as [~hodge] suggested. The name makes sense, because the MultiResampModel should only ever be used for spectroscopic data where you have slit instances.

stscijgbot commented 4 years ago

Comment by Jane Morrison: Howard please clarify when you say "just use 'MultiResampModel' for all cases " ........ seems to not track with "I would like to see a ResampModel and MultiResampModel".

If you want two models then I should ignoring the first statement ? So in all the image Resample code I will remove the DrizProd and any mention of multiprod and replace it with ResampModel. and then in resample_spec I will used the MultIResamModel. Correct me if I have that wrong. But somehow I think you just want the MultiResampModel to represent both. But if we do that then I don't know how to not have the 2D case have the same problem of referring to first array index: model.slit[0]

I might have to arrange a phone call on this one because I keep hitting a wall on how this is supposed to work

stscijgbot commented 4 years ago

Comment by Howard Bushouse: Sorry for not being more clear in my wording. What I meant to say is that it's theoretically possible to just use MultiResampModel for all cases, but I don't really like doing that, because we'd have to modify all the existing imaging code to use "model.slits[0]", instead of just a simple "model". So that's why I'm in favor of still having 2 separate data models:

"ResampModel", which is essentially just a reworked version of "DrizProductModel" that has the name changed and the meta data cleaned up and fixed (i.e. add all the slit-related keywords that are currently in MultiProductModel, because there are some cases where we need them), and will be a direct drop-in replacement for DrizProductModel in all code that currently uses DrizProductModel

"MultiResampModel", which is essentially just a reworked version of "MultiProductModel" that has the name changed, changes the name of the "products" instances to "slits", and again has any necessary meta data clean up done (I don't think there's too much meta data clean up necessary in this one - MultiProductModel already had pretty much everything we needed)

stscijgbot commented 4 years ago

Comment by Jane Morrison: I added a slitmeta.schema and updated the multiresamp and revamp schemas.

[~jdavies] could you check out PR4552 and see if I have set up the slitmeta.schema correctly in resamp.schema and multiresamp.schema. Also could you review the multiresamp.py data model I am just guess on how to set this up.

jdavies-st commented 4 years ago

DrizProductModel and MultiProductModel definitely cannot be merged into the same model. That is impossible given the different structures of the models. DrizProd is like an ImageModel MultiProd is like a MultiSlitModel.

As for your summary, yes that sounds good generally.

I think the reason we went with DrizProductModel is that the only difference between this model and the normal ImageModel is the lack of DQ and ERR and the addition of WHT and CTX. And that is solely because the drizzle algorithm cannot make DQ and ERR arrays correctly, and it outputs these others which are sorta useful. So while ResampledModel is more descriptive, it's not accurate. There's no reason for general resampled data, we can't use ImageModel as is. We ought not to be subclassing just to know that an image has been resampled. We can set attributes for that, and in fact already have an attribute set of something has gone through the resample step.

As for a name for the resampled multi-spec data, again, the reason we don't just use MultiSlitModel is because of drizzle algorithm not making DQ and ERR arrays. That's the only reason.

I don't know if these are good enough reasons to make new datamodels rather than just updating ImageModel and MultSlitModel with some optional WHT and CTX arrays.

stscijgbot commented 4 years ago

Comment by Jane Morrison: [~jdavies] [~hodge] [~bushouse] I updated the PR for this work, updating image.schema and slit data.schema.

I still have the resampmeta.schema and the slitmeta.schema. Resampmeta.schema was created from lev3_prod.schema and meta in drizproduct.schema. The data is set up in a 'meta' form which is how it was done in two original schemas. slitmeta.schema was pulled out of slit data.schema. The original data in slit data.schema was not in a 'meta' form. So if we leave it like this I will need to update the code and add meta. I wonder if we want to put the resampmeta and slitmeta into core schema instead. Added a section in core schema for both of them ? If we keep they way I have set up slitmeta.schema I am not sure if I set it up correctly in slit data.schema. Is it included before or after properties ?

stscijgbot commented 4 years ago

Comment by Jane Morrison: Updated PR to update image model and slit data model.

stscijgbot commented 4 years ago

Comment by Jonathan Eisenhamer: Note: At least make an announcement that models have disappeared. May consider putting in a deprecation.

stscijgbot commented 4 years ago

Comment by Jane Morrison: I am looking at how MIRIRampModel was set to Deprecated. It is done in ramp.py because probably before ramp.py included both the RampModel and MIRIRampModel. Since I have deleted the data models drizproduct and multiproduct can I put the deprecation warning in another datamodels or do I need to bring back drizproduct.py and multiproduct.py just to put the deprecation warning ?

stscijgbot commented 4 years ago

Comment by Jane Morrison: [~jdavies] I am looking at how MIRIRampModel was set to Deprecated. It is done in ramp.py because probably before ramp.py included both the RampModel and MIRIRampModel. Since I have deleted the data models drizproduct and multiproduct can I put the deprecation warning in another datamodels or do I need to bring back drizproduct.py and multiproduct.py just to put the deprecation warning ?

stscijgbot commented 4 years ago

Comment by Jonathan Eisenhamer: I would say bring them back completely, only because if someone is using them, they still can, but get the deprecation.

stscijgbot commented 4 years ago

Comment by Howard Bushouse: Fixed by [https://github.com/spacetelescope/jwst/pull/4552]