pacificclimate / data-prep-actions

Data Preparation actions - record of ephemera used to prepare data for PCIC data portals and other tools
0 stars 0 forks source link

Data Preparation Actions

Purpose

This repository brings together the shell scripts, configuration files, log file outputs, and other ephemera of data preparation actions. Keeping these in a repo has the following purposes and benefits:

  1. Record-keeping
  2. Easy re-use of scripts, etc., written for different but similar cases
  3. Single source for deployment and execution of these scripts

Security note

IMPORTANT

Dangerous combination:

Therefore:

Organization

Repo contents are organized by actions. The term "action" is loosely defined, but it encompasses the processing of a specific group of files with a specific goal in mind.

Example action: To form climatological means of CMIP5 output files and index them in the Climate Explorer modelmeta database (ce_meta). This description would form part (or all) of the DESCRIPTION.md file for that action. This action would be composed of the following sub-actions:

Actions are organized under the actions subdirectory as follows. Only subdirectories for sub-actions actually done need to be included.

    actions/
        <action-name>/
            DESCRIPTION.md
            convert-metadata/
                # updates.yaml file(s) for update_metadata script
                # shell script(s) for invoking updata_metadata
                # log files generated
            climo-means/
                # shell script(s) for adding files to jobqueue
                # log files generated
            index/
                # shell script(s) for invoking index_netcdf
                # log files generated
            modelmeta-fixup/
                # SQL query used to correct the filepath (or other fixup, heaven forbid)
                # log files generated

Further subdirectories can be added as needed (and ideally documented here) for additional sub-actions.

Recommended process

  1. Branch per action: When starting an action, create a new branch from master. Use the branch to manage the action. Merge the branch to master once the action is complete and all desired files have been added to the repo. A PR may be a convenient way to accomplish this.
  2. Issue per action: It may also be useful to use issues in this repo to manage the work to be done, i.e., the actions to be taken. This would couple neatly with PRs.

Caution re. large files

Log files can be bulky, so be careful about adding large log files to this repo.

GitHub recommends that repositories be kept under 1 GB, and enforces a strict limit of 100 MB per file. For large files, either store a pointer to our local storage, or use GitHub Large File Storage.

Also note that if large files are stored in this repo, then they are replicated wherever this repo is cloned. That may not be desirable.