spacetelescope / stdatamodels

https://stdatamodels.readthedocs.io
Other
5 stars 25 forks source link

link FITS_rec instances to hdu extensions on save #178

Closed braingram closed 1 year ago

braingram commented 1 year ago

DataModel converts recarrays to FITS_rec on assignment these were not correctly linked to FITS extensions on save resulting in duplication of FITS_rec data on save.

This adds a test for the duplication and some book keeping to allow linking the converted FITS_recs and extensions on save.

Astropy creates a 'view' of the FITS_rec when assigned to a new FITS extension which makes a simple 'is' check complicated (so the strategy used for ndarray and NDArrayType could not be used). Instead, the links between FITS_rec and extensions are tracked (via the FitsContext) which allows creation of the links without having re-find the hdu created from the FITS_rec. (The existing strategy is left in place for ndarray to maintain compatibility with existing ASDF-in-FITS behavior which does not require the same schema structure as the DataModels).

Resolves JP-3277

Regression tests running: https://plwishmaster.stsci.edu:8081/blue/organizations/jenkins/RT%2FJWST-Developers-Pull-Requests/detail/JWST-Developers-Pull-Requests/776/pipeline/202/

Docs failures are unrelated (see: https://github.com/spacetelescope/stdatamodels/pull/180)

Checklist

braingram commented 1 year ago

I'm seeing a few failures in the regression tests: https://plwishmaster.stsci.edu:8081/blue/organizations/jenkins/RT%2FJWST-Developers-Pull-Requests/detail/JWST-Developers-Pull-Requests/776/tests

[stable-deps] test_nircam_tsgrism_stage2[calints] – jwst.regtest.test_nircam_tsgrism1m 14s
[stable-deps] test_nircam_tsgrism_stage2[extract_2d] – jwst.regtest.test_nircam_tsgrism1s
[stable-deps] test_nircam_tsgrism_stage2[flat_field] – jwst.regtest.test_nircam_tsgrism2s
[stable-deps] test_nircam_tsgrism_stage2[o012_crfints] – jwst.regtest.test_nircam_tsgrism<1s
[stable-deps] test_nircam_tsgrism_stage2[srctype] – jwst.regtest.test_nircam_tsgrism1s
[stable-deps] test_nircam_tsgrism_stage2[x1dints] – jwst.regtest.test_nircam_tsgrism<1s
[stable-deps] test_nircam_tsgrism_stage3_x1dints – jwst.regtest.test_nircam_tsgrism

@hbushouse @tapastro are these expected failures?

An example error is:

     Headers contain differences:
       Headers have different number of cards:
        a: 228
        b: 234
       Extra keyword 'R_DFLAT' in b: 'N/A'
       Extra keyword 'R_FFLAT' in b: 'N/A'
       Extra keyword 'R_SFLAT' in b: 'N/A'
       Inconsistent duplicates of keyword ''      :
        Occurs 41 time(s) in a, 44 times in (b)
       Keyword         [24] has different values:
          a> Nirspec FORE Model reference file information
           ? ^^^^^^^^ --------
          b> DFlat reference file information
           ? ^  ++
       Keyword         [25] has different values:
          a> Nirspec FPA Model reference file information
           ? -------- ^^^^^^^
          b> FFlat reference file information
           ?  ^ ++
       Keyword         [26] has different values:
          a> Gain reference file information
           ? ^ ^^
          b> SFlat reference file information
           ? ^^^ ^
       Keyword         [27] has different values:
          a> IFU fore reference file information
           ? ^ ^^^ ^
          b> Nirspec FORE Model reference file information
           ? ^^^^^^^^ ^^^^^ ^ +
       Keyword         [28] has different values:
          a> IFU post reference file information
          b> Nirspec FPA Model reference file information
       Keyword         [29] has different values:
          a> IFU slicer reference file information
           ? ^^^^^^ ^^^
          b> Gain reference file information
           ? ^^ ^
       Keyword         [30] has different values:
          a> Linearity reference file information
           ? ^^^ -----
          b> IFU fore reference file information
           ? ^^^^^^^
       Keyword         [31] has different values:
          a> Mask reference file information
           ? ^^ ^
          b> IFU post reference file information
           ? ^^^^^^ ^
       Keyword         [32] has different values:
          a> Nirspec MSA Model reference file information
          b> IFU slicer reference file information
       Keyword         [33] has different values:
          a> Nirspec OTE Model reference file information
          b> Linearity reference file information
       Keyword         [34] has different values:
          a> Photometric reference file information
           ? ^^^^^^^^^^^
          b> Mask reference file information
           ? ^^^^
       Keyword         [35] has different values:
          a> Read noise reference file information
          b> Nirspec MSA Model reference file information
       Keyword         [36] has different values:
          a> Regions reference file information
          b> Nirspec OTE Model reference file information
       Keyword         [37] has different values:
          a> Saturation reference file information
           ? ^^ ^ -- ^^
          b> Photometric reference file information
           ? ^^^ ^^^^  ^
       Keyword         [38] has different values:
          a> Spectral distortion reference file information
           ? ^^ --- --   ^^^^^^^
          b> Read noise reference file information
           ? ^   +++  ^
       Keyword         [39] has different values:
          a> Superbias reference file information
           ? ^^^ ^^ ^
          b> Regions reference file information
           ? ^ ^ ^^
       Keyword         [40] has different values:
          a> Wavelength Range reference file information
          b> Saturation reference file information
       Keyword         [41] has different values:
          a> Calibration step information
          b> Spectral distortion reference file information
hbushouse commented 1 year ago

The regtest errors are expected only in the sense that they've been happening for a while now, but we can't figure out why. But regardless, they have nothing to do with this PR, so you're good.

braingram commented 1 year ago

I think this should address JP-3277 but it probably makes sense to test this PR on one of the failures mentioned on the ticket. However, I don't know how to do that :) I'm happy to learn (if that makes sense) but would need to know how to find the data and run the failing step.

tapastro commented 1 year ago

I can give it a try. I'll install this PR version of stdatamodels and test it out. But if you want to try, feel free to install jwst in your environment, and you may be able to follow the repo Wiki for regression testing locally and use the tso regression test data to verify?

codecov[bot] commented 1 year ago

Codecov Report

Patch coverage: 94.44% and project coverage change: +0.04 :tada:

Comparison is base (065ec6e) 63.90% compared to head (9d30b03) 63.95%.

:exclamation: Current head 9d30b03 differs from pull request most recent head eb516d5. Consider uploading reports for the commit eb516d5 to get more accurate results

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #178 +/- ## ========================================== + Coverage 63.90% 63.95% +0.04% ========================================== Files 101 101 Lines 5555 5565 +10 ========================================== + Hits 3550 3559 +9 - Misses 2005 2006 +1 ``` | [Impacted Files](https://app.codecov.io/gh/spacetelescope/stdatamodels/pull/178?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=spacetelescope) | Coverage Δ | | |---|---|---| | [src/stdatamodels/fits\_support.py](https://app.codecov.io/gh/spacetelescope/stdatamodels/pull/178?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=spacetelescope#diff-c3JjL3N0ZGF0YW1vZGVscy9maXRzX3N1cHBvcnQucHk=) | `90.27% <94.44%> (-0.01%)` | :arrow_down: |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

braingram commented 1 year ago

I ran the test_niriss_soss_stage3_crfints regtest locally (this isn't a perfect fit for the runs mentioned in the ticket which appear to use NIRSPEC data but this test does run the 'calwebb_tso3' pipeline).

Looking at the data during test with and without the changes in this PR...

Without this PR, jw01091002001_03101_00001-seg001_nis_short_x1dints.fits, contains duplicated data (30 EXTRACT1D and 1 INT_TIMES extensions that are not referenced from the ASDF extension which contains 31 embedded binary blocks).

With this PR the same files contains the same EXTRACT1D and INT_TIMES extensions however these are linked in the ASDF extension (which contains no binary blocks).

hbushouse commented 1 year ago

@tapastro Have you had a chance to do independent testing of this fix?

tapastro commented 1 year ago

I ran a single spec2 association which used to produce a 440MB x1dints file, where the asdf extension was ~223 MB. This code produced a 250MB file with a 32MB asdf extension. Seems like it did a good job! I don't understand how to check if any of the arrays are linked or reproduced in the extension, though.

braingram commented 1 year ago

Thanks for checking! That definitely looks promising.

There's no 'official' way to check the links. What I've been doing is similar to what is in the test added with this PR: https://github.com/spacetelescope/stdatamodels/blob/fbd97cc7ab5a675d36e42db25218f8bffec8f494/tests/test_fits.py#L667-L677

In brief, when stdatamodels links an item in the ASDF tree to a fits extension it changes the data 'source' stored in the ASDF tree to a string like fits:EXTRACT1D,1. This differs from a non-linked array/table which will have a 'source' that is a number corresponding to the index of the ASDF binary block (in this case embedded in the ASDF extension).

So a normal array will appear something like this in the ASDF yaml.

lookup_table: !core/ndarray-1.0.0
  source: 0
  datatype: float32
  byteorder: big
  shape: [2048, 2048]

Whereas a linked array will appear something like:

dq: !core/ndarray-1.0.0
  source: fits:DQ,1
  datatype: uint32
  byteorder: big
  shape: [10, 256, 2048]

(note that the above 2 examples were both taken from the same '*_nis_short_calints.fits' file so a file can have both linked and non-linked arrays/tables, in this case the non-linked arrays are related to the wcs).

Let me know if more details would be helpful.

tapastro commented 1 year ago

Ah, in that case it's not quite going right - I see 6618 unlinked arrays, and doing a findall on "source:" shows there are 9928 total arrays in the file. I can move this to somewhere public for you to take a look, if you want.