Improve performance of `DataModel` reading for single (or simple) metadata access

braingram commented 4 months ago

There are use cases (like the jwst resample step) where loading a single keyword from many DataModel containing fits files may be useful. As the number of files might be very large and opening every model might exceed reasonable amounts of RAM it will be important to have a performant way to perform these simple keyword accesses.

Using meta.wcsinfo.s_region (contained in the ASDF extension) as an example there are a few ways this keyword can be read:

1) Using stdatamodels.jwst.datamodels.open:

with dm.open(fn) as m:
    return m.meta.wcsinfo.s_region

2) Using stdatamodels.asdf_in_fits:

with asdf_in_fits.open(fn) as af:
    return af['meta']['wcsinfo']['s_region']

3) Using astropy.io.fits and asdf.open:

with fits.open(fn) as ff:
    bs = io.BytesIO(ff['ASDF'].data.tobytes())
    with asdf.open(bs) as af:
        return af['meta']['wcsinfo']['s_region']

4) Using astropy.io.fits and asdf.util.load_yaml:

with fits.open(fn) as ff:
    bs = io.BytesIO(ff['ASDF'].data.tobytes())
    tree = asdf.util.load_yaml(bs)
    return tree['meta']['wcsinfo']['s_region']

Of the 4 options above 1-3 are similar in performance (using both a ImageModel and a larger IFUImageModel as test files). With performance being limited primarily by asdf.open (more on that below). 4 is much faster in both cases. See the below table for performance (run with cProfile so slightly slower than real).

	ImageModel	IFUImageModel
dm.open	0.245	3.945
asdf_in_fits	0.137	3.815
asdf.open	0.143	3.820
load_yaml	0.036	0.332

The table shows it's ~10x faster to use load_yaml as this skips:

validation
mapping fits based on the schema
loading arrays
conversion from asdf flavored yaml to custom python objects

Although all of the above is public API it seems worthwhile to investigate wrapping 4 as a helper function in stdatamodels (with sufficient documentation about how this is dangerous).

Below is a snakeviz generated graph of the call to dm.open for the IFUImageModel data file showing the bulk of the time spent in asdf.open:

nden commented 4 months ago

Although all of the above is public API it seems worthwhile to investigate wrapping 4 as a helper function in stdatamodels (with sufficient documentation about how this is dangerous).

I agree having this function would be very useful. What dangers do you foresee?

braingram commented 4 months ago

Thanks for taking a look at this issue. Most of the dangers are that it bypasses all of the stdatamodels and asdf machinery. This means:

no validation of the tree (although this was validated on write)
no checking that the tree matches the fits data (even via the FITS_HASH, so if a user modifies a fits header the modification will not be seen with load_yaml)
no filling of defaults or reference resolution (although I'm not sure either of these are used)
no conversion from tagged to custom objects (so many objects, like arrays, will come through as dictionaries)
no mapping of arrays to fits data

I think for the s_region use case all of the above are ok. Are there other places in jwst where this might be useful?

perrygreenfield commented 4 months ago

If one were concerned with validation, the option could be to do it the current way (though I would not make that the option for operations since the file should have been validated, and no sneaky, invalid updates should have been done). But I agree that an easy-to-use version of 4) should be provided. I can see mining of the files meta data is extremely useful, particularly out of operations and should be as efficient as possible.

braingram commented 3 months ago

ModelContainer in jwst also provides a models_grouped method that groups models based on a small set of metadata keywords: https://github.com/spacetelescope/jwst/blob/master/jwst/datamodels/container.py#L466 These keywords are all strings in the primary fits header. Related to this issue, it would be useful to have an efficient way to read these keywords without incurring the overhead of traversing the schema. These attributes are duplicated in the ASDF extension, however the values in the fits header should be preferred (as is already done for datamodels.open).

It would be great if the efficient metadata access code could also load multiple fits keywords.

spacetelescope / stdatamodels

Improve performance of `DataModel` reading for single (or simple) metadata access #284