m-lab / etl-gardener

Gardener provides services for maintaining and reprocessing mlab data.
Apache License 2.0
13 stars 5 forks source link
etl pipeline

gardener

branch travis-ci report-card coveralls
master Travis Build Status Go Report Card Coverage Status

Gardener provides services for maintaining and reprocessing M-Lab data.

Overview

The v2 data pipeline depends on the gardener for daily and historical processing.

Daily processing is daily around 10:30 UTC to allow time for nodes to upload data and the daily transfer jobs to copy data to the public archives. Historical processing is currently continuous. As soon as one pass has completed, the gardener starts again from its start date.

For both of these modes, gardener issues Jobs (dates) to parsers that request them. The parsers will enuemerate all files for that date and parse each, and report status updates to the gardener for the Job date until all are complete.

Jobs API

Parsers request new date jobs from the gardener via the Jobs API. The API supports four operations:

These resources are available on the -gardener_addr.

Status Page

Gardener maintains a status page on a separate status port, that summarizes recent jobs, current state, and any errors. Jobs transition through the following stages:

The status page is available on the -status_port.

Local Development with Parser

Both the gardener and parsers support a local development mode. To run both follow the following steps.

Create a test configuration, e.g. test.yml, with a subset of the production configuration that includes only the datatype you are working with.

Run the gardener v2 ("manager" mode) with local writer support:

go get ./cmd/gardener
~/bin/gardener \
    -project=mlab-sandbox \
    -status_port=:8082 \
    -gardener_addr=localhost:8081 \
    -prometheusx.listen-address=:9991 \
    -config_path=config/test.yml \
    -saver.backend=local \
    -saver.dir=singleton

Run the parser to target the local gardener:

go get ./cmd/etl_worker
gcloud auth application-default login
~/bin/etl_worker \
  -gardener_addr=localhost:8081 \
  -output_dir=./output \
  -output=local

If the start_date in the input test.yml for your datatype includes archive files, then the parser should begin parsing archives immediately and writing them to the ./output directory.

Unit Testing

Some of the gardener packages depend on complex, third-party services. To accommodate these the gardener unit tests are split into three categories: