This repository contains schemas for Mozilla's data ingestion pipeline and data lake outputs.
The JSON schemas are used to validate incoming submissions at ingestion time. They also are used as the source of truth for defining metadata about how each document type should be handled in the pipeline (see the metaschema).
The jsonschema
[Python]
and everit-org/json-schema
[Java]
library (using draft 4) are used for JSON Schema Validation in this repository's
tests.
This has implications for what kinds of string patterns are supported,
see the Conformance
section in the linked document for further details.
Note that as of 2019, the data pipeline uses the
everit-org/json-schema library
for validation in production (see
#302).
To learn more about writing JSON Schemas, Understanding JSON Schema is a great resource.
templates
directory first. Make use of common schema components from the templates/include
directory where possible, including things like the telemetry environment
, clientId
, application
block, or UUID patterns. The filename should be templates/<namespace>/<doctype>/<doctype>.<version>.schema.json
.schemas
directory) in to the git repo as well. See the rationale for this in the "Notes" section below.validation
directory.main
branch. See also the notes on contributions.CMake
(3.0+)jq
(1.5+)python
(3.6+)java 11
, maven
On MacOS, these prerequisites can be installed using homebrew:
brew install cmake
brew install jq
brew install python
brew install --cask docker
git clone https://github.com/mozilla-services/mozilla-pipeline-schemas.git
cd mozilla-pipeline-schemas
mkdir release
cd release
cmake .. # this is the build process (the schemas are built with cmake templates)
You can generally skip this step if you're just make a small change to an existing schema: tests are automatically run via continuous integration.
The tests expect example pings to be in the validation/<namespace>/
subdirectory, with files named
in the form <ping type>.<version>.<test name>.pass.json
for documents expected to be valid, or
<ping type>.<version>.<test name>.fail.json
for documents expected to fail validation.
The test name
should match the pattern [0-9a-zA-Z_]+
To run the tests, make use of the wrapper scripts:
./scripts/mps-build
./scripts/mps-test
Follow the CMake Build Instructions above to update the schemas
directory.
To run the unit-tests, run the following commands:
# optional: activate a virtual environment with python3.6+
python3 -m venv venv
source venv/bin/activate
# install python dependencies, if they haven't already
pip install -r requirements-dev.txt
pip install .
# run the tests, with 8 parallel processes
pytest -n 8
# run tests for a specific namespace and doctype
pytest -k telemetry/main.4
# run java tests only (if Java is configured)
pytest -k java
To generate a diff of BigQuery schemas, use the mps
command-line tool.
# optionally, enter the mozilla-pipeline-schemas environment
# for jsonschema-transpiler and python3 dependencies
./script/mps-shell
# generate an integration folder, the options will default to HEAD and main
# respectively
mps bigquery diff --base-ref main --head-ref HEAD
This generates an integration
folder:
integration
├── bq_schema_f59ca95-d502688.diff
├── d502688
│ ├── activity-stream.events.1.bq
│ ├── activity-stream.impression-stats.1.bq
...
│ └── webpagetest.webpagetest-run.1.bq
└── f59ca95
├── activity-stream.events.1.bq
├── activity-stream.impression-stats.1.bq
...
└── webpagetest.webpagetest-run.1.bq
Pushes to the main repo will trigger integration tests in CircleCI that directly
compare the revision to the main
branch. These tests do not run for forked PRs
in order to protect data and credentials, but reviewers can trigger tests to run
by pushing the PR's revisions to a branch of the main repo. We provide a script for this:
# Before running, double check that the PR doesn't make any changes to
# .circleci/config.yml that could spill sensitive environment variables
# or data contents to the public CircleCI logs.
./.github/push-to-trigger-integration <username>:<branchname>
For details on how to compare two arbitrary revisions, refer to the integration
job in .circleci/config.yml
. For more documentation, see mozilla-services/edge-validator.
mps
command-line toolThe repository has an mps
command-line tool for checking on the output of
schema transformations used for BigQuery. Enter the shell using
scripts/mps-shell
.
To transpile a schema for Bigquery:
schema=schemas/glean/glean/glean.1.schema.json
mps bigquery transpile $schema
It may be useful to look at a compact version of the output:
schema=schemas/glean/glean/glean.1.schema.json
mps bigquery transpile $schema | mps bigquery columns /dev/stdin
The output of the ingestion sink can be viewed for validation documents.
validation=validation/glean/glean.1.baseline.pass.json
mps bigquery transform $validation | jq
Any value that is not captured in the schema is put into additional_properties
.
validation=validation/glean/glean.1.baseline.pass.json
mps bigquery transform $validation | jq '.additional_properties'
"{\"$schema\":\"moz://mozilla.org/schemas/glean/ping/1\"}"
There is a daily series of tasks run by Airflow (see the
probe_scraper
DAG)
that uses the main
branch of this repository as input and ends up pushing
final JSONSchema and BigQuery schema files to the generated-schemas
branch.
As of January 2020, deploying schema changes still requires manual intervention
by a member of the Data Ops team, but you can generally expect schemas to be
deployed to production BigQuery tables several times a week.
include/glean/CHANGELOG.md
.CODEOWNERS
), a CODEOWNER (usually
SRE) will automatically be assigned to review the PR. Please follow
additional change control procedures
for PRs referencing these schemas. The CODEOWNER will be responsible for
merging the PR once it has been approved.Bug XXX - Description of change
, that way the Bugzilla PR Linker will automatically add an attachment with your PR to bugzilla, for future reference.All schemas are generated from the 'templates' directory and written into the 'schemas' directory (i.e., the artifacts are generated/saved back into the repository) and validated against the draft 4 schema a copy of which resides in the 'tests' directory. The reason for this is twofold:
We have a number of scripts to keep the schemas in sync with various in-tree
definitions. See the contents of the scripts
subdirectory.