spacetelescope / jwst

Python library for science observations from the James Webb Space Telescope
https://jwst-pipeline.readthedocs.io/en/latest/
Other
564 stars 167 forks source link

allow reference file overrides to be data models in scripts #2246

Open stscijgbot opened 6 years ago

stscijgbot commented 6 years ago

Issue JP-345 was created by Misty Cracraft:

While writing a pytest for dq_init to use as an example for my team, I tried to call the pipeline step by passing in a science data model and a reference file data model as an override. I got the error below:

E jwst.stpipe.config_parser.ValidationError: Config parameter 'override_mask': the value "<MaskModel(1024, 1032) from jwst_miri_mask_0021.fits>" is of the wrong type.

After conferring with James Davies, we discovered that the override portion of the code is expecting a string that points to the reference file (path and filename) and it crashed when I tried to pass it a data model. So I had to read in my reference file, make modifications to it, write it back out as a file, and then read the modified file in as the file override. This seems like a lot of unnecessary work, and it would make it easier for the testers and future users if they could read in a data model for the reference file override as well.

 

 

 

jdavies-st commented 5 years ago

So there is a possiblity that this could be done via a custom validator in jwst.stpipe.config_parser. We already have custom validators for two file-like types:

    if validator is None:
        validator = Validator()
        validator.functions['input_file'] = _get_input_file_check(root_dir)
        validator.functions['output_file'] = _get_output_file_check(root_dir)

It would be straightforward to add a datamodel type that would be either a path to a file or an open datamodel in memory. One would just need to write an appropriate validator.

That would mean that

    override_mask = string(default=None)

configspec would become

    override_mask = datamodel(default=None)

and datamodel would be something that could be passed to datamodels.open().

stscijgbot commented 5 years ago

Comment by Todd Miller: [~cracraft]  we're starting to look at supporting this better but it may revolve around a larger stpipe refactoring.   Nevertheless,  can you share the code you have  to write now?  

stscijgbot commented 5 years ago

Comment by Misty Cracraft: At this point, I've re-written the unit tests that I was working on when I filed the ticket. We got around the problem by creating reference file models and feeding them into the individual portions of the code rather than the full step (dq_initialization rather than DQInitStep). I can write out a sample of what I was trying to do (as it will be useful for future tests), but I don't currently have any remaining examples of code that didn't work.

stscijgbot commented 5 years ago

Comment by Todd Miller: OK,  thanks.    If an example is easy and you have time,  please do.   Otherwise I think SCSB needs to come up with a recommended approach.  It's better though to know exactly what you were trying to do to make sure we address it or come up with something better.

 

jdavies-st commented 5 years ago

Here's a simple use case that currently fails. Does this represent roughly what you were trying to do?

In [1]: from jwst import datamodels

In [8]: from jwst.dq_init import DQInitStep

In [9]: input = datamodels.Level1bModel((1, 1, 2048, 2048))

In [10]: mask = datamodels.MaskModel((2048, 2048))

In [11]: mask
Out[11]: <MaskModel(2048, 2048)>

In [12]: result = DQInitStep.call(input, override_mask=mask)
---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
<ipython-input-12-62cacea1646b> in <module>
----> 1 result = DQInitStep.call(input, override_mask=mask)

~/dev/jwst/jwst/stpipe/step.py in call(cls, *args, **kwargs)
    529                 config, name=name, config_file=config_file)
    530         else:
--> 531             instance = cls(**kwargs)
    532         return instance.run(*args)
    533 

~/dev/jwst/jwst/stpipe/step.py in __init__(self, name, parent, config_file, _validate_kwds, **kws)
    286             spec = self.load_spec_file()
    287             kws = config_parser.config_from_dict(
--> 288                 kws, spec, root_dir=dirname(config_file or ''))
    289 
    290         if name is None:

~/dev/jwst/jwst/stpipe/config_parser.py in config_from_dict(d, spec, root_dir)
    194 
    195     if spec:
--> 196         validate(config, spec, root_dir=root_dir)
    197     else:
    198         config.walk(string_to_python_type)

~/dev/jwst/jwst/stpipe/config_parser.py in validate(config, spec, section, validator, root_dir)
    255 
    256         if len(messages):
--> 257             raise ValidationError('\n'.join(messages))
    258     finally:
    259         config.main.configspec = orig_configspec

ValidationError: Config parameter 'override_mask': the value "<MaskModel(2048, 2048)>" is of the wrong type.
stscijgbot commented 5 years ago

Comment by Todd Miller: [~cracraft]  

An example use case from James,  let me know if that's close enough.

In [8]: from jwst.dq_init import DQInitStep

In [9]: input = datamodels.Level1bModel((1, 1, 2048, 2048))

In [10]: mask = datamodels.MaskModel((2048, 2048))

In [11]: mask Out[11]: <MaskModel(2048, 2048)>

In [12]: result = DQInitStep.call(input, override_mask=mask)

ValidationError Traceback (most recent call last)

in ----> 1 result = DQInitStep.call(input, override_mask=mask) ~/dev/jwst/jwst/stpipe/step.py in call(cls, *args, **kwargs) 529 config, name=name, config_file=config_file) 530 else: --> 531 instance = cls(**kwargs) 532 return instance.run(*args) 533 ~/dev/jwst/jwst/stpipe/step.py in __init__(self, name, parent, config_file, _validate_kwds, **kws) 286 spec = self.load_spec_file() 287 kws = config_parser.config_from_dict( --> 288 kws, spec, root_dir=dirname(config_file or '')) 289 290 if name is None: ~/dev/jwst/jwst/stpipe/config_parser.py in config_from_dict(d, spec, root_dir) 194 195 if spec: --> 196 validate(config, spec, root_dir=root_dir) 197 else: 198 config.walk(string_to_python_type) ~/dev/jwst/jwst/stpipe/config_parser.py in validate(config, spec, section, validator, root_dir) 255 256 if len(messages): --> 257 raise ValidationError('\n'.join(messages)) 258 finally: 259 config.main.configspec = orig_configspec ValidationError: Config parameter 'override_mask': the value "" is of the wrong type.
stscijgbot commented 5 years ago

Comment by Misty Cracraft: Ok, here's a sample of my dq_init tests. This works if I use the modified line > outfile = do_dqinit(dm_ramp, ref_data) instead of > outfile = DQInitStep(dm_ramp, override_mask = ref_data). But this tests just the substep and not the full step that sets up the files.

And I just saw the test case from James. I get the same type of error when I try to run the code below with the DQInitStep being called.

 

{color:#000080}from {color}jwst.dq_init {color:#000080}import {color}DQInitStep {color:#000080}from {color}jwst.dq_init.dq_initialization {color:#000080}import {color}do_dqinit

def test_dqim():     # Check that PIXELDQ is initialized with the information from the reference file.     # test that a flagged value in the reference file flags the PIXELDQ array     # read in reference mask file_

    # size of integration     __     nints = 1     ngroups = 5     xsize = 1032     ysize = 1024

    # create raw input data for step     __     dm_ramp = make_rawramp(nints, ngroups, ysize, xsize)

    # create a MaskModel for the dq input mask

    __     dq, dq_def = make_maskmodel(ysize, xsize)

    # edit reference file with known bad pixel values

        dq[100, 100] = 2   # Dead pixel         dq[200, 100] = 4   # Hot pixel     _     dq[300, 100] = 8   # Unreliableslope         dq[400, 100] = 16  # RC         dq[500, 100] = 1   _# Do_notuse         dq[100, 200] = 3   # Dead pixel + do not use         dq[200, 200] = 5   # Hot pixel + do not use         dq[300, 200] = 9   # Unreliable slope + do not use         dq[400, 200] = 17  # RC + do not use

    # write mask model     __     ref_data = MaskModel(dq=dq, dq_def=dq_def)     ref_data.meta.instrument.name = 'MIRI'     **     ref_data.meta.subarray.xstart = 1     ref_data.meta.subarray.xsize = xsize     ref_data.meta.subarray.ystart = 1     ref_data.meta.subarray.ysize = ysize

    _# run dodqinit

    _#outfile = do_dqinit(dm_ramp, refdata)     _outfile = DQInitStep(dm_ramp, override_mask = ref_data)_     print(outfile.err.shape)     dqdata = outfile.pixeldq

    # assert that the pixels read back in match the mapping from ref data to science data     __     assert(dqdata[100, 100] == dqflags.pixel['DEAD'])     assert(dqdata[200, 100] == dqflags.pixel['HOT'])     assert(dqdata[300, 100] == dqflags.pixel['UNRELIABLE_SLOPE'])     assert(dqdata[400, 100] == dqflags.pixel['RC'])     assert(dqdata[500, 100] == dqflags.pixel['DO_NOT_USE'])     assert(dqdata[100, 200] == 1025)     assert(dqdata[200, 200] == 2049)     assert(dqdata[300, 200] == 16777217)     assert(dqdata[400, 200] == 16385)

 

 

def makemaskmodel(ysize, xsize):     # create a mask model for the dqinit step     _     csize = (ysize, xsize)     dq = np.zeros(csize, dtype=int)     # how do we define a dqdef extension?     __     mask = MaskModel()

    dqdef = [(0, 1, 'DO_NOT_USE', 'Bad Pixel do not use'),              (1, 2, 'DEAD', 'Dead Pixel'),              (2, 4, 'HOT', 'Hot pixel'),              (3, 8, 'UNRELIABLE_SLOPE', 'Large slope variance'),              (4, 16, 'RC', 'RC pixel'),              (5, 32, 'REFERENCE_PIXEL', 'Reference Pixel')]

    dq_def = np.array((dqdef), dtype=mask.dq_def.dtype)

    __     return dq, dq_def

 

def makerawramp(nints, ngroups, ysize, xsize):     # create the data and groupdq arrays_     __     csize = (nints, ngroups, ysize, xsize)     data = np.full(csize, 1.0)

    # create a JWST datamodel for MIRI data     __     dm_ramp = MIRIRampModel(data=data)     dm_ramp.meta.subarray.xstart = 1     dm_ramp.meta.subarray.xsize = xsize     dm_ramp.meta.subarray.ystart = 1     dm_ramp.meta.subarray.ysize = ysize

    return dm_ramp

stscijgbot commented 5 years ago

Comment by Todd Miller: Perfect.   Addressing this may take some time,  but now we know (on this ticket) what we should ideally support.    Thanks!

 

stscijgbot commented 5 years ago

Comment by Todd Miller: A possible approach for minimizing "collateral damage" occurred to me.

The basic issue for collateral damage is that the Step.get_reference_file() method returns a filepath,  not a model,  a key interface with almost every _step.py module.  Switching to returning an open model would therefore potentially affect every _step.py and can't be done easily for reftypes with no model.

One possible work around would be to invent a "file handle string" to use in place of a filename for model-based overrides.   The override model would be associated with the handle string in datamodels and when a constructor is run on the handle string,  datamodels returns an open shallow copy(?) of the original model,  much like when a constructor is called on an open model.

A handle might be something like:  memory://FlatModel-9c4d0201-3a6e-466a-82ca-c30d706eb27a.

With this implementation,  no temporary file or other file system footprint is nominally used/needed/supported.

James' original idea on github of using a config validator function for a filename or model for overrides would then both validate and create the datamodels association for the handle.   get_reference_file() would return the handle.

 

stscijgbot commented 5 years ago

Comment by Todd Miller: [discussed but rejected]

A possible approach for minimizing "collateral damage" occurred to me.

The basic issue for collateral damage is that the Step.get_reference_file() method returns a filepath,  not a model,  a key interface with almost every _step.py module.  Switching to returning an open model would therefore potentially affect every _step.py and can't be done easily for reftypes with no model.

One possible work around would be to invent a "file handle string" to use in place of a filename for model-based overrides.   The override model would be associated with the handle string in datamodels and when a constructor is run on the handle string,  datamodels returns an open shallow copy of the original model,  much like when a constructor is called on an open model.

A handle might be something like:  memory://FlatModel-9c4d0201-3a6e-466a-82ca-c30d706eb27a.

With this implementation,  no temporary file or other file system footprint is nominally used/needed/supported.

James' original idea on github of using a config validator function for a filename or model for overrides would then both validate and create the datamodels association for the handle.   get_reference_file() would return the handle.

{color:#FF0000}After discussing several things became clear:{color}

Unlike the way datamodels.open() returns a shallow copy of a model,  for override handles the datamodels would need to be augmented by a registry mapping the handle to a model.  While weak references might help with this,  there was general reservation about creating a registry due to the risk of memory leaks.

Carefully reviewing the control flow of several XXX_step.py modules revealed that the comparison of a model to 'N/A' should work as well as the comparison of a filepath string to 'N/A'.   So the change in method signature for get_reference_file(),  while less than ideal,  is not as toxic as it at first sounds.

The most common control flow is to pass the reference file path returned by get_reference_file() into a Model subclass constructor which also works using an existing model.

The changes to stpipe required to support override models should not break any code when override models are not being used.

Considering all these things,  it's very possible that limited changes to stpipe and/or the crds_client module can enable overrides with low risk for existing use patterns based on filepath strings.

 

 

jdavies-st commented 8 months ago

Initially resolved by #3514

But there's still a bug, as when Step.prefetch_references = True, which is the default, one gets:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/miniconda3/envs/jwst-dev/lib/python3.11/site-packages/stdatamodels/properties.py:350, in ObjectNode.__getattr__(self, attr)
    349 try:
--> 350     val = self._instance[attr]
    351 except KeyError:

KeyError: 'strip'

During handling of the above exception, another exception occurred:

AttributeError                            Traceback (most recent call last)
Cell In[15], line 1
----> 1 pipe("jw01345002001_14201_00003_nrcalong_uncal.fits")

File ~/dev/stpipe/src/stpipe/step.py:500, in Step.run(self, *args)
    498 else:
    499     if self.prefetch_references:
--> 500         self.prefetch(*args)
    501     try:
    502         step_result = self.process(*args)

File ~/dev/stpipe/src/stpipe/step.py:616, in Step.prefetch(self, *args)
    614 # prefetch truly occurs at the Pipeline (or subclass) level.
    615 if len(args) and len(self.reference_file_types) and not self.skip:
--> 616     self._precache_references(args[0])

File ~/dev/stpipe/src/stpipe/pipeline.py:269, in Pipeline._precache_references(self, input_file)
    265 try:
    266     with self.open_model(
    267         input_file, asn_n_members=1, asn_exptypes=["science"]
    268     ) as model:
--> 269         self._precache_references_opened(model)
    270 except (ValueError, TypeError, OSError):
    271     self.log.info("First argument %s does not appear to be a model", input_file)

File ~/dev/stpipe/src/stpipe/pipeline.py:290, in Pipeline._precache_references_opened(self, model_or_container)
    287         self._precache_references_opened(contained_model)
    288 else:
    289     # precache a single model object
--> 290     self._precache_references_impl(model_or_container)

File ~/dev/stpipe/src/stpipe/pipeline.py:328, in Pipeline._precache_references_impl(self, model)
    324 how = "Override" if reftype in ovr_refs else "Prefetch"
    325 self.log.info(
    326     "%s for %s reference file is '%s'.", how, reftype.upper(), refpath
    327 )
--> 328 crds_client.check_reference_open(refpath)

File ~/dev/stpipe/src/stpipe/crds_client.py:80, in check_reference_open(refpath)
     75 def check_reference_open(refpath):
     76     """Verify that `refpath` exists and is readable for the current user.
     77 
     78     Ignore reference path values of "N/A" or "" for checking.
     79     """
---> 80     if refpath != "N/A" and refpath.strip() != "":
     81         with open(refpath, "rb"):
     82             pass

File ~/miniconda3/envs/jwst-dev/lib/python3.11/site-packages/stdatamodels/properties.py:353, in ObjectNode.__getattr__(self, attr)
    351 except KeyError:
    352     if schema == {}:
--> 353         raise AttributeError("No attribute '{0}'".format(attr))
    355     val = _make_default(attr, schema, self._ctx)
    356     if val is not None:

AttributeError: No attribute 'strip'

Clearly prefetching for overrides that are open datamodels doesn't make sense. Prefetching any overrides perhaps doesn't make sense.

It is likely this was original fixed but has since been broken again, as the test that tests all of this calls the process() method instead of run(). That skips prefetch.

https://github.com/spacetelescope/jwst/blob/726f4ce553accde3c3e0367ecbc6e2986c7b2bb2/jwst/stpipe/tests/test_overrides.py#L86-L97

Of course this code now lives in stpipe, and will have to be fixed there.