monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
56 stars 26 forks source link
monarchinitiative

PyPI Build Status

Dipper

Dipper is a Python package to generate RDF triples from common scientific resources. Dipper includes subpackages and modules to create graphical models of this data, including:

Installing Dipper:

Dipper requires Python 3.6 or higher.

Getting started:

Building locally

To build locally, clone this repo and install the requirements using pip.

Note, Dipper imports source modules dynamically at runtime. As a result it is possible to build a core set of requirements and add source specific dependencies as needed. Presently this only implemented with pip requirements files. For example to build dependencies for MGI:

    pip3 install -r requirements.txt
    pip3 install -r requirements/mgi.txt

To install dependencies for all sources:

    pip3 install -r requirements.txt
    pip3 install -r requirements/all-sources.txt

If you encounter any errors installing these packages using Homebrew, it could be due to a curent known issue in upgrading to pip3. In this case, first force reinstall pip2 (pip2 install --upgrade --force-reinstall pip) and then install the package using pip3 (eg. pip3 install psycopg2.)

Documentation:

The full documentation, including API docs, can be found on read the docs.

Sources:

Identifiers

Our identifier documentation as referenced in our recent paper on identifiers(doi:10.1371/journal.pbio.2001414)[https://doi.org/10.1371/journal.pbio.2001414]

For instance, Monarch has type-agnostic in-house redirection rules like https://monarchinitiative.org/<curie> where the curie is in prefixed notation like OMIM:154700.

The kinds of external identifiers that we reference are listed here https://github.com/monarch-initiative/dipper/blob/master/dipper/curie_map.yaml

For more information on our identifiers, see here.

About the Dipper project

The Dipper data pipeline was born out of the need for a uniform representation of human and model organism genotype-to-phenotype data, and an Extract-Transform-Load (ETL) pipeline to process it all. It became too cumbersome to first get all of these data into a relational schema; so, we decided to go straight from each source into triples that are semantically captured, using standard modeling patterns. We are currently working on tooling around defining, documenting, and constraining our schema as biolink models.

Citing Dipper

A manuscript on the Dipper pipeline is in preparation. In the meantime, if you use any of our code or derived data, please cite this repository and doi: 10.1093/nar/gkw1128.