Data files go in a /data directory (none of which are ignored, so beware of dumping large files):
data/raw: original, untouched source files (e.g. Excel files we got from researcher). Seldom used by us, as most of our data are in SQL databases
data/interim: files created manually or by scripts to aid the process (e.g. generated list of species, coordinates, interim dump). Don't go overboard with this one, only dump files that might be useful to others.
data/processed: final data, to be published or referenced (e.g. dwc output). Use only when processing is done by script. For SQL views, it's not useful to dump files here.
Notebooks and SQL files that generate the processed data go in /src
Vocabularies and other files needed by processing go in: /setting
Whip specifications go in: /specification
Naming conventions:
Dataset directory = dataset shortname = lowercase-with-dashes (as it is used in the dataset URL): so, no changes here
All other files are lowercase_with_dashes (cf. our conventions for R and Python code). Capitalized letters are allowed where they make sense (acronyms, .Rmd)
Dwc mapping scripts, specifications, outputs are named dwc_typeofcoreorextension, e.g. dwc_event.R or dwc_occurrence.yaml or dwc_measurementOrFact.csv. If a single script generates all the DwC mapping, call it dwc_mapping.R
Notebooks no longer keep the name of the person who created them. Rather, name them in the order they should be executed: 1_gbif_match.ipynb, 2_verify_synonyms.ipynb.
/cc @LienReyserhove @DimEvil @stijnvanhoey If you have no comments, I'll add this conventions to the README of our repo and accept this PR.
This pull requests implements the Cookiecutter data science for all dataset directories.
Structure conventions:
/data
directory (none of which are ignored, so beware of dumping large files):data/raw
: original, untouched source files (e.g. Excel files we got from researcher). Seldom used by us, as most of our data are in SQL databasesdata/interim
: files created manually or by scripts to aid the process (e.g. generated list of species, coordinates, interim dump). Don't go overboard with this one, only dump files that might be useful to others.data/processed
: final data, to be published or referenced (e.g. dwc output). Use only when processing is done by script. For SQL views, it's not useful to dump files here./src
/setting
/specification
Naming conventions:
lowercase-with-dashes
(as it is used in the dataset URL): so, no changes herelowercase_with_dashes
(cf. our conventions for R and Python code). Capitalized letters are allowed where they make sense (acronyms,.Rmd
)dwc_typeofcoreorextension
, e.g.dwc_event.R
ordwc_occurrence.yaml
ordwc_measurementOrFact.csv
. If a single script generates all the DwC mapping, call itdwc_mapping.R
1_gbif_match.ipynb
,2_verify_synonyms.ipynb
./cc @LienReyserhove @DimEvil @stijnvanhoey If you have no comments, I'll add this conventions to the README of our repo and accept this PR.