ncasuk / amf-check-writer

Library to write AMF compliance checks
BSD 3-Clause "New" or "Revised" License
0 stars 4 forks source link

amf-check-writer

This repo contains scripts to:

The checks are generated in YAML format for use with the cc-yaml plugin for compliance-checker. The code for the checks themselves is implemented in compliance-check-lib.

Installation

Depencendies for Compliance Checker and compliance-check-lib include some packages that must be compiled from source, which can be tricky to set up. The recommended way to get set up is to use a CentOS 6 machine and do the following:

(alternatively use a JASMIN VM which will already have the JAP and those packages installed)

Then create a Python 2.7 virtual environment and install the required python packages:

virtalenv -p python2.7 --system-site-packages venv
source venv/bin/activate

pip install git+https://github.com/ncasuk/amf-check-writer \
            git+https://github.com/cedadev/compliance-checker \
            git+https://github.com/cedadev/compliance-check-lib \
            git+https://github.com/cedadev/cc-yaml

Quickstart

The simplified workflow to create the checks and vocabs is:

  1. download
  2. make checks
  3. make CVs

Define a temporary output directory and create it to write the checks/vocabs to:

export DATA_DIR=$PWD/check-data
mkdir -p $DATA_DIR

Set the version of the checks/vocabs to use:

VERSION=v2.0

NOTE: Before downloading the spreadsheets the first time, see the 'authentication' section below.

Download the content of the Google spreadsheet vocabularies/rules into local files:

download-from-drive -v $VERSION --regenerate --secrets client-secret.json $DATA_DIR

Run a script to create the YAML representation of the checks:

create-yaml-checks -s $DATA_DIR -v $VERSION

Run a script to create the Controlled Vocabularies (in JSON and PYESSV formats):

create-cvs -s $DATA_DIR -v $VERSION

Run an example check (maybe having downloaded the training data):

# Set the PYESSV DIRECTORY TO USE:
export PYESSV_ARCHIVE_HOME=$DATA_DIR/$VERSION/pyessv-vocabs

# Run the checker on some test data
TEST_FILE=../NCAS-Data-Project-Training-Data/Data/ncas-anemometer-1_ral_29001225_mean-winds_v0.1.nc

amf-checker --yaml-dir $DATA_DIR/$VERSION/checks $TEST_FILE --version $VERSION

Scripts

download-from-drive

Usage: download-from-drive [--secrets <secrets JSON>] <output dir>.

This script recursively finds all spreadsheets under a folder in Google Drive and saves each worksheet as a .tsv file (the root folder ID is hardcoded in amf_check_writer/download-from-drive.py).

The directory structure of the Drive folder is preserved, and a directory for each spreadsheet is created. The individual sheets are saved as <sheet name>.tsv inside the spreadsheet directory.

For example, after running download-from-drive /tmp/mysheets with a test folder:

$ tree /tmp/mysheets
/tmp/mysheets
├── first-spreadsheet.xlsx
│   ├── Sheet1.tsv
│   └── Sheet2.tsv
└── sub-folder
    ├── second-spreadsheet.xlsx
    │   └── Sheet1.tsv
    └── sub-sub-dir
        └── other-spreadsheet.xlsx
            └── my-sheet.tsv

5 directories, 4 files

Authentication

Downloding spreadsheets from Google Drive requires the script to authenticate as your Google account. This is done using a JSON file obtained from the Google API dashboard.

Alternatively follow the quickstart guide on the Google sheets site to enable the sheets API and create credentials (this also allows you to create a new project):

https://developers.google.com/sheets/api/quickstart/python

After this visit the API dashboard to enable the Drive API, as detailed above. You do not need to create another credentials JSON file.

create-cvs

Usage: create-cvs [--pyessv-dir <pyessv root>] <spreadsheets dir> <output dir>.

This script reads .tsv files downloaded with download-from-drive, and generates controlled vocabularies in JSON format from various worksheets. Each file is saved in <output dir> as AMF_<name>.json.

CVs are created for:

The format of the CVs is specific to each type.

Each CV is also saved with pyessv and written to pyessv's archive directory. The directory can be overridden with the --pyessv-dir option. Beware that if you use a non-standard pyessv archive directory, you must set PYESSV_ARCHIVE_HOME environment variable accordingly when running compliance-checker or amf-checker.

create-yaml-checks

Usage: create-yaml-checks <spreadsheets dir> <output dir>.

This script reads .tsv files and produces YAML checks to be used with cc-yaml and compliance-check-lib.

Similar to create-cvs, checks are saved in <output dir> as AMF_name.yml. Checks are created for:

For each data product/deployment mode combination, a check AMF_product_<name>_<mode>.yml is created that includes global checks and the relevant variable/dimensions checks for the product and mode. e.g.:

AMF_product_soil_land.yml:

suite_name: product_soil_land_checks
checks:
# Global checks
- {__INCLUDE__: AMF_file_info.yml}
- {__INCLUDE__: AMF_file_structure.yml}
- {__INCLUDE__: AMF_global_attrs.yml}
# Common checks for 'land' deployment mode
- {__INCLUDE__: AMF_product_common_dimension_land.yml}
- {__INCLUDE__: AMF_product_common_variable_land.yml}
# Product specific
- {__INCLUDE__: AMF_product_soil_dimension.yml}
- {__INCLUDE__: AMF_product_soil_variable.yml}

amf-checker

Usage: amf-checker [--yaml-dir <yaml dir>] [-o <output dir>] [-f <output format>] <dataset>...

Wrapper script around compliance-checker to automatically find and run the relevant YAML checks for AMF datasets. See --help output for detailed help on the meaning of the available options.

<dataset> can be either the path to a NetCDF file or a directory, in which case all files in the directory are checked. Multiple files/directories can be given, so shell globs can be used: e.g.

amf-checker /path/to/data/*.nc

Testing

There are tests - run using:

pytest amf_check_writer/tests.py