This repo contains scripts to:
Download spreadsheets containing specifications for AMF data products from a folder in Google Drive
Generate check suites for the IOOS compliance checker based on the contents of the spreadsheets
Generate controlled vocabulary files from the spreadsheets
The checks are generated in YAML format for use with the cc-yaml plugin for compliance-checker. The code for the checks themselves is implemented in compliance-check-lib.
Depencendies for Compliance Checker and compliance-check-lib include some packages that must be compiled from source, which can be tricky to set up. The recommended way to get set up is to use a CentOS 6 machine and do the following:
Install the JASMIN Analysis Platform
Install the following packages: yum install python27-netCDF4 python27-iris python27-cf python27-virtualenv python27-cf_units
(alternatively use a JASMIN VM which will already have the JAP and those packages installed)
Then create a Python 2.7 virtual environment and install the required python packages:
virtalenv -p python2.7 --system-site-packages venv
source venv/bin/activate
pip install git+https://github.com/ncasuk/amf-check-writer \
git+https://github.com/cedadev/compliance-checker \
git+https://github.com/cedadev/compliance-check-lib \
git+https://github.com/cedadev/cc-yaml
The simplified workflow to create the checks and vocabs is:
Define a temporary output directory and create it to write the checks/vocabs to:
export DATA_DIR=$PWD/check-data
mkdir -p $DATA_DIR
Set the version of the checks/vocabs to use:
VERSION=v2.0
NOTE: Before downloading the spreadsheets the first time, see the 'authentication' section below.
Download the content of the Google spreadsheet vocabularies/rules into local files:
download-from-drive -v $VERSION --regenerate --secrets client-secret.json $DATA_DIR
Run a script to create the YAML representation of the checks:
create-yaml-checks -s $DATA_DIR -v $VERSION
Run a script to create the Controlled Vocabularies (in JSON and PYESSV formats):
create-cvs -s $DATA_DIR -v $VERSION
Run an example check (maybe having downloaded the training data):
# Set the PYESSV DIRECTORY TO USE:
export PYESSV_ARCHIVE_HOME=$DATA_DIR/$VERSION/pyessv-vocabs
# Run the checker on some test data
TEST_FILE=../NCAS-Data-Project-Training-Data/Data/ncas-anemometer-1_ral_29001225_mean-winds_v0.1.nc
amf-checker --yaml-dir $DATA_DIR/$VERSION/checks $TEST_FILE --version $VERSION
Usage: download-from-drive [--secrets <secrets JSON>] <output dir>
.
This script recursively finds all spreadsheets under a folder in Google Drive
and saves each worksheet as a .tsv file (the root folder ID is hardcoded in
amf_check_writer/download-from-drive.py
).
The directory structure of the Drive folder is preserved, and a directory for
each spreadsheet is created. The individual sheets are saved as
<sheet name>.tsv
inside the spreadsheet directory.
For example, after running download-from-drive /tmp/mysheets
with
a test folder:
$ tree /tmp/mysheets
/tmp/mysheets
├── first-spreadsheet.xlsx
│ ├── Sheet1.tsv
│ └── Sheet2.tsv
└── sub-folder
├── second-spreadsheet.xlsx
│ └── Sheet1.tsv
└── sub-sub-dir
└── other-spreadsheet.xlsx
└── my-sheet.tsv
5 directories, 4 files
Downloding spreadsheets from Google Drive requires the script to authenticate as your Google account. This is done using a JSON file obtained from the Google API dashboard.
Select a project from the dropdown in the header bar, or create a new project (blue button named 'Create project')
Click the 'Enable APIs and Services' button in the header bar
Search for 'Google Drive API'. Click the result and press 'Enable'. Return to the dashboard and do the same for 'Google Sheets API'
Return to the dashboard and click 'Credentials' in the sidebar on the left (key icon)
Click 'Create credentials' and select 'OAuth client ID'. Select 'Desktop app' for application type and follow the prompts. Dismiss the popup that appears.
You should see the newly created credentials in the table. On the right hand side of the table there is a download icon ('Download JSON'). Click it and save the JSON file.
Run download-from-drive
and use the --secrets
option to point to the JSON
file just downloaded. Credentials are cached in ~/.credentials
after
initial authentication, so --secrets
is only required the first time.
You will be given a URL to visit in a web browser and prompted for a verification code. This lets you sign into a Google account and give permission for the app to access your data on Google drive/sheets.
Alternatively follow the quickstart guide on the Google sheets site to enable the sheets API and create credentials (this also allows you to create a new project):
https://developers.google.com/sheets/api/quickstart/python
After this visit the API dashboard to enable the Drive API, as detailed above. You do not need to create another credentials JSON file.
Usage: create-cvs [--pyessv-dir <pyessv root>] <spreadsheets dir> <output dir>
.
This script reads .tsv files downloaded with download-from-drive
, and
generates controlled vocabularies in JSON format from various worksheets. Each
file is saved in <output dir>
as AMF_<name>.json
.
CVs are created for:
AMF_scientist.json
)AMF_product_common_{variable,dimension}_{air,land,sea}.json
)The format of the CVs is specific to each type.
Each CV is also saved with pyessv and
written to pyessv's archive directory. The directory can be overridden with the
--pyessv-dir
option. Beware that if you use a non-standard pyessv archive
directory, you must set PYESSV_ARCHIVE_HOME
environment variable accordingly
when running compliance-checker
or amf-checker
.
Usage: create-yaml-checks <spreadsheets dir> <output dir>
.
This script reads .tsv files and produces YAML checks to be used with cc-yaml and compliance-check-lib.
Similar to create-cvs
, checks are saved in <output dir>
as AMF_name.yml
.
Checks are created for:
For each data product/deployment mode combination, a check
AMF_product_<name>_<mode>.yml
is created that includes global checks and the relevant
variable/dimensions checks for the product and mode. e.g.:
AMF_product_soil_land.yml
:
suite_name: product_soil_land_checks
checks:
# Global checks
- {__INCLUDE__: AMF_file_info.yml}
- {__INCLUDE__: AMF_file_structure.yml}
- {__INCLUDE__: AMF_global_attrs.yml}
# Common checks for 'land' deployment mode
- {__INCLUDE__: AMF_product_common_dimension_land.yml}
- {__INCLUDE__: AMF_product_common_variable_land.yml}
# Product specific
- {__INCLUDE__: AMF_product_soil_dimension.yml}
- {__INCLUDE__: AMF_product_soil_variable.yml}
Usage: amf-checker [--yaml-dir <yaml dir>] [-o <output dir>] [-f <output format>] <dataset>...
Wrapper script around compliance-checker to automatically find and run the
relevant YAML checks for AMF datasets. See --help
output for detailed help on
the meaning of the available options.
<dataset>
can be either the path to a NetCDF file or a directory, in which
case all files in the directory are checked. Multiple files/directories can be
given, so shell globs can be used: e.g.
amf-checker /path/to/data/*.nc
There are tests - run using:
pytest amf_check_writer/tests.py