A runtime system for NMDC data management and orchestration.
http://nmdcstatus.polyneme.xyz/
issues
tracks issues related to NMDC, which may necessitate work across multiple repos.
nmdc-schema houses the LinkML schema specification, as well as generated artifacts (e.g. JSON Schema).
nmdc-server houses code specific to the data portal -- its database, back-end API, and front-end application.
workflow_documentation references workflow code spread across several repositories, that take source data and produce computed data.
This repo (nmdc-runtime)
The NMDC metadata as of 2021-10 is available here:
https://drs.microbiomedata.org/ga4gh/drs/v1/objects/sys086d541
The link returns a GA4GH DRS API bundle object record, with the NMDC metadata collections (study_set, biosample_set, etc.) as contents, each a DRS API blob object.
For example the blob for the study_set collection export, named "study_set.jsonl.gz", is listed with DRS API ID "sys0xsry70". Thus, it is retrievable via
https://drs.microbiomedata.org/ga4gh/drs/v1/objects/sys0xsry70
The returned blob object record lists https://nmdc-runtime.files.polyneme.xyz/nmdcdb-mongoexport/2021-10-14/study_set.jsonl.gz as the url for an access method.
The 2021-10 exports are currently all accessible at https://nmdc-runtime.files.polyneme.xyz/nmdcdb-mongoexport/2021-10-14/${COLLECTION_NAME}.jsonl.gz
, but the DRS API indirection allows these links to change in the future, for mirroring via other URLs, etc. So, the DRS API links should be the links you share.
The runtime features:
Dagster orchestration:
workspace
. This code is loaded from
one or more dagster repositories
. Each Dagster repository
may be run with a different
Python virtual environment if need be, and may be loaded from a local Python file or
pip install
ed from an external source. In our case, each Dagster repository
is simply
loaded from a Python file local to the nmdc-runtime GitHub repository, and all code is
run in the same Python environment.solids
and pipelines
,
and optionally schedules
and sensors
.
solids
represent individual units of computationpipelines
are built up from solidsschedules
trigger recurring pipeline runs based on timesensors
trigger pipeline runs based on external statepipeline
can declare dependencies on any runtime resources
or additional
configuration. There are MongoDB resources
defined, as well as preset
configuration definitions for both "dev" and "prod" modes
. The preset
s tell Dagster to
look to a set of known environment variables to load resources configurations, depending on
the mode
.A MongoDB database supporting write-once, high-throughput internal data storage by the nmdc-runtime FastAPI instance.
A FastAPI service to interface with the orchestrator and database, as a hub for data management and workflow automation.
Ensure Docker (and Docker Compose) are installed; and the Docker engine is running.
docker --version
docker compose version
docker info
Ensure the permissions of ./mongoKeyFile
are such that only the file's owner can read or write the file.
chmod 600 ./mongoKeyFile
Ensure you have a .env
file for the Docker services to source from. You may copy .env.example
to
.env
(which is gitignore'd) to get started.
cp .env.example .env
Create environment variables in your shell session, based upon the contents of the .env
file.
set -a # automatically export all variables
source .env
set +a
If you are connecting to resources that require an SSH tunnel—for example, a MongoDB server that is only accessible on the NERSC network—set up the SSH tunnel.
The following command could be useful to you, either directly or as a template (see Makefile
).
make nersc-mongo-tunnels
Finally, spin up the Docker Compose stack.
make up-dev
Docker Compose is used to start local MongoDB and PostgresSQL (used by Dagster) instances, as well as a Dagster web server (dagit) and daemon (dagster-daemon).
The Dagit web server is viewable at http://127.0.0.1:3000/.
The FastAPI service is viewable at http://127.0.0.1:8000/ -- e.g., rendered documentation at http://127.0.0.1:8000/redoc/.
Tests can be found in tests
and are run with the following commands:
On an M1 Mac? May need to export DOCKER_DEFAULT_PLATFORM=linux/amd64
.
make up-test
make test
As you create Dagster solids and pipelines, add tests in tests/
to check that your code behaves as
desired and does not break over time.
This repository contains a GitHub Actions workflow that publishes a Python package to PyPI.
You can also manually publish the Python package to PyPI by issuing the following commands in the root directory of the repository:
rm -rf dist
python -m build
twine upload dist/*
Here are links related to this repository: