subugoe / leine

Data Pipelines for @subugoe/wag
https://subugoe.github.io/leine
MIT License
1 stars 0 forks source link
data-pipeline data-science rstats

leine

Main Codecov test coverage CRAN status Lifecycle: experimental

wikipedia::

The Leine (German: [ˈlaɪnə]; Old Saxon Lagina) is a river in Thuringia and Lower Saxony, Germany.

WAG runs several big data pipelines used in various data products.

These pipelines, though largely not themselves run in R, are here organised into an R package.

Design

WAG is a relatively small team of data analysts, serving academic and librarian stakeholders with various data products.

The data engineering of our pipelines has to correspond to these constraints:

Priorities

From this follows:

  1. Cheap compute is good, but convenience is better. Our workloads are comparatively minor, labor costs are a much bigger driver.
  2. Special-purpose tools are good, but standardising on fewer tools is better. Given our small, and sometimes churning team, we can only support very few tools.
  3. Working prototypes are good, but reproducibility is better. For our academic (as well as librarian) stakeholders, reproduciblity trumps all else.
  4. Interactive, one-off results are good, but automation, testing and documentation are better. Given churn (and vacation, context-switching, etc.), we must avoid low bus factors. Data pipelines, especially, must be designed to be run and be maintainable without the original developer.

ELT

Our data pipelines follow an extract-load-transform paradigm. They are centered a "data river" (or data like) hosted on the Google Cloud Platform (GCP).

  1. Data river
    1. Data is extracted from sources in its rawest form into to GCP Cloud Storage (for long-term versioned coldline storage).
    2. Data is then loaded into GCP BigQuery. If the source data is schemaless or noncompliant, it is loaded without schema, with entire unparsed entries as cells.
  2. Data warehouse
    1. Data is then transformed into a canonical form in GCP BigQuery with a well-defined schema.
  3. Data mart
    1. Data is then further transformed according to shared needs of WAGs data products on GCP BigQuery.