Brain Data Standards Ontologies

A repository for building ontologies for the Brain Data Standards Project.

Status: Draft

Cite:

Overview:

The main purpose of this repo is to automate data driven cell-type ontology development for the Brain Data Standards initiative. The main inputs are:

Dendrograms (JSON) format, provided by the Allen, encoding data driven classification of brain cell types. These files also include a nomenclature standard (and mapping system) developed by the Allen: https://arxiv.org/abs/2006.05406. See dendrogram spec for details.
CSV files identifying and summarising dendograms - including species & anatomical region
CSV mapping files that combine dendogram nodes into groupings tht do not correspond to any single dendrogram node, but do correspond to known cell types.
Marker files (robot templates) that map marker combinations with high predictive capacity for dendrogram nodes (generated by NS-Forest) onto those nodes
Automatically seeded, manually curated robot templates mapping nodes to classes in CL and to various properties (e.g. soma location)

Figure 1: Build overview

The Build system is an extended version of the Ontology Development Kit - an automated ontology build system using ROBOT and MakeFiles. As well as managing the build from input files, this also automatically generated modules from referenced ontologies and integrates these into the build.

Schema

Schema doc

Building

You will need Docker installed. Running a build will pull the required containers with all required dependencies.

To build

cd src/ontology
sh ./run.sh make prepare_release

This dynamically updates imports as well as building reasoned release files. The slowest part of the build is mirroring (downloading and reserialising) external ontologies. If you've run a build recently, mirrored versions will already be stored in the src/ontology/mirror. To run a build without mirroring:

cd src/ontology
sh ./run.sh make prepare_release MIR=false

To extend the ontologies imported from. Edit bdscratch-odk.yaml to add the required ontology to import_group.products, then run:

sh ./run.sh make update_repo

The update the import statements in src/ontology/bdscratch-edit.owl.

Extensions to the standard ODK MakeFile build

Extensions to the build are specified (as per ODK standard) in bdscratch.Makefile.

Building robot templates from Dendrograms

Dendrograms live in /src/dendrograms/. They are named according to their Allen Dendrogram ID, e.g. CCN201908210.json

We expect dendrograms to remain stable for relatively long periods of time and at least some generated Robot templates are intended to be manually edited to map to CL classes / property driven classification. For these reasons, we store generated templates on the repo and build them as needed using a separate MakeFile - src/dendrograms/Makefile.

To build (be careful you don't wipe out curation!):

cd src/dendrograms
# Build all
sh ./run.sh make
# Build specific template
sh ./run.sh make <template_filename>
# Build a specific set of tempaltes
sh ./run.sh make JOBS=<dendrogram_id>

Tempaltes are build from dendrograms using python scripts in src/scripts

Extended information about groupings of taxonomy nodes that are candidates for curation are stored in additional tsv files (accession.tsv) Support for incorporating this informtion into templates is TBA.

Robot templates

Robot templates live in /src/tempaltes/.

filename	e.g.	Description
{accession}.tsv	CCN201810310.tsv	Template for generating taxonomy as OWL individuals
{accession}_class.tsv	CCN201810310_class.tsv	Templates for generating classes corresponding to OWL individuals in taxonomy. Includes slots for curating cell type & properties
{accession}_markers.tsv	CCN201810310_markers.tsv	Templates for adding markers. Referenced markers must be present in gene reference files.
ensmusg.tsv	{ensembl_gene_file}.tsv	Robot template listing all genes (all possible markers) for analysis/dendrogams of some specific species.

ensembl_gene_file name follows standard ensembl ID prefixes but in lowercase e.g. ensmusg.tsv (ensembl mouse gene) has genes with IDs of the form: ENSMUSG{numeric_accession}

Markers

Markers are referenced by enembl ID using an identifiers.org URL scheme

ensembl gene file templates are used to generate mirror files, which act as source files for import generation, so that only referenced markers end up in the release files.

Reference Gene Files

GTF files used as reference for BDSO can be found in this google drive folder

obophenotype / brain_data_standards_ontologies

readme