Develop Database Schema

bourque commented 6 years ago

We need to develop a schema for the jwql database. I think a decent starting point is something like the schema I used for ACS Quicklook:

In this schema, we have a master table that keeps track of each rootname that is in the database and when it was ingested. The datasets table keeps track of which filetypes exist for a given rootname. Then there is a table for each detector/extension/filetype combination which is basically a dump of the headers (columns are header keys and values are header values).

To construct this for jwql, we will need to know the following for each instrument:

What are all of the possible filetypes and what purpose do they serve?
What is the data structure for each filetype (i.e. number of extensions, what purpose each extension serves, what datatype each one is)?
What are the header keywords for each filetype/extension combination?

gkanarek commented 6 years ago

how flexible is this? the JWST keywords, header info, filetypes, etc. are still in flux (nowhere near as stable as WFC3), so we need to be able to evolve the schema in response to these changes.

bourque commented 6 years ago

If we build this right, changes to the schema for the header tables should be as simple as updating a text file and adding/removing columns in the database. Changes to the data structure itself (i.e. new filetypes, new/different FITS extensions) would be a bit trickier because that would mean adding new tables and not just new columns.

This brings up another question: How often should we anticipate changes to the header keywords/filetypes/FITS extensions after launch?

bhilbert4 commented 6 years ago

My guess is that header keyword changes after launch won't be too common, but I'm sure it will happen from time to time.

For what it's worth, I have a function that returns all of the header keywords for a requested reference file type. It does this by reading in the appropriate schema definition files that SSB has in the JWST Calibration Pipeline repo. I doubt it would be hard to update it to work on the data filetypes.

bhilbert4 commented 6 years ago

Filetypes that will be ingested into MAST:

_uncal.fits (raw)
_rate.fits, _rateints.fits (countrate images, level-2a)
_cal.fits, _calints.fits (flux calibrated, full WCS-added countrate images, level-2b)
_i2d.fits, _s2d.fits, _s3d.fits (resampled, both for individual exposures and combined)
_x1d.fits (extracted spectra, both for individual exposures and combined)

I'll put together more details on each soon.

bhilbert4 commented 6 years ago

Data structures:

JWST jargon

frame = one readout of the detector
group = made from single frame or (onboard) average of multiple frames
integration = multiple groups, with detector resets before and after (equivalent to single HST file).
exposure = multiple nominally-identical integrations packaged into the same file (like packing multiple HST raw ramps into a single file).

_uncal.fits - raw, uncalibrated file

No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU      89   ()      
  1  SCI           1 ImageHDU        25   (2048, 2048, 10, 1)   int16 (rescales to uint16)   
  2  ZEROFRAME     1 ImageHDU        11   (2048, 2048, 1)   int16 (rescales to uint16)   
  3  GROUP         1 BinTableHDU     35   10R x 13C   [I, I, I, J, I, 26A, I, I, I, I, 36A, D, D]

SCI extension contains the detector data. 4 dimensions (detector y, detector x, groups per integration, integrations)
ZEROFRAME extension contains the 0th frame that goes with each integration. For some readout patterns, each group will be the average of N frames. This averaging is done on board JWST. The 0th frame is saved to this separate extension for cases where the initial read is needed for slope fitting. 3 dimensions (detector y, detector x, integrations)
GROUP extension is a binary table that contains detailed timing information about the exposure. The table contains 13 columns, and one row for each M milliseconds of the exposure.

GROUP (13 columns x 1 rows):

 Col# Name (Units)       Format
   1 integration_number   I
   2 group_number         I
   3 end_day              I
   4 end_milliseconds     J
   5 end_submilliseconds  I
   6 group_end_time       26A
   7 number_of_columns    I
   8 number_of_rows       I
   9 number_of_gaps       I
  10 completion_code_numb I
  11 completion_code_text 36A
  12 bary_end_time (MJD)  D
  13 helio_end_time (MJD) D

_rate.fits - countrate images (equivalent to HST flt)

This is the output of the Level 2A pipeline, which includes basic calibrations (superbias subtraction, linearity correction, slope fitting). For an exposure that contains a single integration the *_rate.fits file contains the slope image created by line-fitting to the groups of the integration. For an exposure that contains multiple integrations, this *_rate.fits image contains the mean slope image from all integrations. In this case, the pipeline also outputs a *_rateints.fits file. That file contains the seperate slope images from all of the integrations. Therefore add one dimension to those shown below for extensions 1-5.

No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU     159   ()      
  1  SCI           1 ImageHDU        29   (2048, 2048)   float32   
  2  ERR           1 ImageHDU        10   (2048, 2048)   float32   
  3  DQ            1 ImageHDU        11   (2048, 2048)   int32 (rescales to uint32)   
  4  VAR_POISSON    1 ImageHDU         9   (2048, 2048)   float32   
  5  VAR_RNOISE    1 ImageHDU         9   (2048, 2048)   float32   
  6  ASDF          1 ImageHDU         7   (3889,)   uint8

SCI extension - slope images. 2-dimensional (detector y, detector x)
ERR extension - errors on the slope values. 2-dimensional (detector y, detector x)
DQ extension - data quality array. 2-dimensional (detector y, detector x)
VAR_POISSON - contribution to the variance on the slopes due to Poisson noise. 2-dimensional (detector y, detector x)
VAR_RNOISE - contribution to the variance on the slopes due to readnoise. 2-dimensional (detector y, detector x)
ASDF - Contains distortion correction model information

*_cal.fits - Calibrated file

Output from level 2b pipeline. Flux calibration applied, flat field applied, distortion solution added. Similar to the *_rate.fits and *_rateints.fits files above, there are *_cal.fits files (containing the averaged image if more than one integration per exposure, or the single image if a single integration), and a *_calints.fits file (which contains the individual calibrated image if there are multiple integrations per exposure).

No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU     250   ()      
  1  SCI           1 ImageHDU        32   (2048, 2048)   float32   
  2  ERR           1 ImageHDU        10   (2048, 2048)   float32   
  3  DQ            1 ImageHDU        11   (2048, 2048)   int32 (rescales to uint32)   
  4  AREA          1 ImageHDU         9   (2048, 2048)   float32   
  5  VAR_POISSON    1 ImageHDU         9   (2048, 2048)   float32   
  6  VAR_RNOISE    1 ImageHDU         9   (2048, 2048)   float32   
  7  ASDF          1 ImageHDU         7   (13515,)   uint8

Extensions are the same as in the case of the *_rate.fits image, plus the AREA extension, which is a 2D image containing the pixel area map.

bourque commented 6 years ago

Thanks @bhilbert4 this is very helpful!

bhilbert4 commented 6 years ago

JDox page on filetypes and formats: https://jwst-docs-stage.stsci.edu/pages/viewpage.action?spaceKey=JDAT&title=File+Naming+Conventions+and+Data+Products

bhilbert4 commented 6 years ago

i2d.fits file format - Identical to _cal.fits format above.

No.    Name      Ver    Type      Cards   Dimensions   Format
  0  PRIMARY       1 PrimaryHDU     230   ()      
  1  SCI           1 ImageHDU        46   (2048, 2048)   float32   
  2  ERR           1 ImageHDU        10   (2048, 2048)   float32   
  3  DQ            1 ImageHDU        11   (2048, 2048)   int32 (rescales to uint32)   
  4  AREA          1 ImageHDU         9   (2048, 2048)   float32   
  5  VAR_POISSON    1 ImageHDU         9   (2048, 2048)   float32   
  6  VAR_RNOISE    1 ImageHDU         9   (2048, 2048)   float32   
  7  ASDF          1 ImageHDU         7   (13749,)   uint8

SaOgaz commented 6 years ago

just to confirm, based on Tom Donaldson's confluence page and what @bhilbert4 has said here, if you have (for ex) a _rate.fits and a _uncal.fits that both correspond to the same original image, everything in the * of the filename is identical?

cracraft commented 6 years ago

My understanding is that the rest of the filename will be consistent between the two.

bhilbert4 commented 6 years ago

Yes, that's correct

bourque commented 6 years ago

Per @SaraOgaz

Jonathon pointed me to this doc page for the pipeline where there’s a whole section about the associations: https://jwst-pipeline.readthedocs.io/en/latest/jwst/associations/index.html

bourque commented 6 years ago

Now that we have decided to the use MAST api, this is no longer needed.

spacetelescope / jwql