rweigel / cdawmeta

BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

About

This package is an interface to CDAWeb's metadata.

It was originally developed for upgrading the metadata from CDAWeb's HAPI servers; the existing server only includes the minimum required metadata.

As discussed in the SPASE section, the code has been extended to remedy issues with existing SPASE metadata for CDAWeb datasets.

The code reads and combines metadata from

As discuss in the SPASE section, an attempt to use existing SPASE records was abandoned.

The code uses requests-cache so re-downloads of any of the above metadata are only made if the HTTP headers indicate it is needed. When any metadata are downloaded, a diff is stored.

The output is

  1. HAPI metadata, which is based on all.xml, Master CDF, and orig_data metadata. (The cadence of the measurements for each dataset by sampling the last CDF file associated with the dataset (based on result from the CDASR orig_data endpoint) and computing a histogram of the differences in timesteps.)

  2. SQL and MongoDB databases available with a search interface:

Also, not yet available in table form is

SPASE

Our initial attempt was to generate HAPI metadata with SPASE records. Several issues were encountered:

  1. Only about 40% of CDAWeb datasets had SPASE records. ~5 years later, there is ~60% coverage. As a result, we had to use metadata from all.xml, Master CDF, and orig_data.

    The implication is that CDAWeb SPASE records cannot be used for one of the intended purposes of SPASE, which is to provide a structured representation of CDAWeb metadata; we needed to duplicate much of the effort that went into creating CDAWeb SPASE records in order to create a complete set of HAPI metadata.

  2. The SPASE metadata is not updated frequently. There are instances where variables have been added to CDAWeb datasets but the SPASE records do not have this. In some instances, SPASE records are missing variables even for datasets that that have not changed.

    The implication is a scientist who executes a search backed by SPASE records may erroneously conclude variables or datasets are not available.

  3. We considered using SPASE Units for variables when they were available; CDAWeb Master metadata has a UNITS attribute, but no consistent convention is followed and in some cases, UNITS are not scientific units but a label. This effort stopped when we noticed instances where the SPASE Units were wrong. (See also A dump of the unique Master UNITS to SPASE Units pairs).

    Th implication is a scientist using SPASE Units to label their plots risks the plot being incorrect.

    In addition, although there is more consistency in the strings used for SPASE Units, SPASE does not require the use of a standard for the syntax (such as VOUnits, udunits2, or QUDT). HAPI has an option to state the standard used for unit strings so that a validator can check and units-aware software can use it to make automatic unit conversions.

Other issues not related to the generation of HAPI metadata were also noticed.

  1. The AccessInformation nodes are structured in a way that is misleading and clarification is needed.

    1. For example, ACE/Ephemeris/PT12M indicates that the Format for all AccessURL is CDF for all four AccessURLs, which not correct. The CDAWeb access URL has has other format options. Also, HAPI is not listed. In other SPASE records where it is listed, only Text is listed as format, but Binary and JSON are available. The SSCWeb access URL does not provide CDF.

    2. The names of the parameters at SSCWeb are not the same as those at CDAWeb; also, more parameters are available from CDAWeb.

    3. There are SPASE records with the ACE Science Center listed as an AccessURL (e.g., ACE/MAG/L2/PT16S). The variable names used in the ACE Science Center files and metadata differ from those listed in the Parameter node. This is confusing.

  2. CDAWeb datasets may have variables with different DEPEND_0s, and the DEPEND_0 may have a different cadence. For example, VOYAGER1_10S_MAG has two DEPEND_0s

    SPASE records can only have a cadence that applies to all parameters, but the listed cadence in the SPASE record for VOYAGER1_10S_MAG is PT9.6S is not correct for the parameters with a DEPEND_0 of Epoch. This is inaccurate and misleading and potentially confusing for the user. The fact that some parameters have a different cadence should be noted in the SPASE record.

  3. Dubious information has been created. For example,

Although HAPI has an additionalMetadata attribute, we are reluctant to reference existing SPASE records due to these issues (primarily 2. and 3.). We conclude that it makes more sense to link to less extensive but correct metadata (for example, to CDF Master metadata or documentation on the CDAWeb website, than to more extensive SPASE metadata that is confusing (see 4.) or incomplete and in some cases incorrect (see items 2.-3.).

After encountering these issues, realized that solving all of the problems could be achieved with some additions to the existing CDAWeb to HAPI metadata code.

Comments

An example of decoupling: Initially we had used spase_DatasetResourceID in CDF Masters to find SPASE records associated with a CDAWeb dataset. We were later told that not all available SPASE records are listed in CDF Masters. So we had to switch over to drawing directly from the hpde.io repository.

Running

git clone https://github.com/rweigel/cdawmeta.git
cd cdawmeta
make hapi

See the comments in Makefile for additional execution options.

The Makefile command above executes the following

python cdaweb.py         # creates data/cdaweb.json
python hapi/hapi-new.py  # creates data/hapi/hapi-new.json and
                         # data/info/*.json using data/catalog-all.json

Compare

To compare the new HAPI metadata with Nand's, first execute cdaweb.py then

python hapi/hapi-nl.py   # creates data/hapi/catalog-all.nl.json using requests to
                         # https://cdaweb.gsfc.nasa.gov/hapi/{catalog,info}
make compare [--include='PATTERN'] # creates data/hapi/compare.log
# PATTERN is a dataset ID regular expression, e.g., '^AC_'

Browse and Search

See table/README.md for browsing and searching metadata from a web interface.