Question: emerging source wants to know how to best model data in RDF

matentzn commented 4 years ago

I am working with people that are currently figuring out their data models. For example, I currently have a case where a group wants to model gene expression with ECO codes, sample types etc. Is there anything (tutorials, docs) we can recommend on how to model this data, so that future design of dipper ingest is as straight forward as possible?

justaddcoffee commented 4 years ago

Others may have better/more specific advice, but a few bits of general advice:

As long as they document their modeling well, we should be able to ingest it without too much trouble
Our current Bgee ingest models gene expression, so it might be worth checking out how this ingest works and patterning their model similarly: https://github.com/monarch-initiative/ingest-artifacts/blob/master/sources/BGee/Bgee_20170112.jpg
It might be useful to point them to the Dipper docs, especially this bit to get an idea of how we model things
One more practical item: the single most common point of failure for Dipper ingests seems to be unstable URLs of artifacts we ingest, so if they can put their data at some URL that is as permanent as possible, that would be helpful

matentzn commented 4 years ago

@justaddcoffee This is exactly what I needed.. Thanks a ton!

TomConlin commented 4 years ago

Hi Nico I mostly try to keep existing models coherent as code and data changes but will arrempt to state the obvious without wading into the deep end of ontologyland modeling.

General modeling

Model the experiment/data and let the biology emerge (or not) on its own.
Consider how you would justify making a distinction to lumper.
Consider how you would justify a generality to a splitter.
Avoid replicating ontologies in your data outputs.
Use ontology labels exclusively in code and documentation. (Since we have to read and reason about code, keep it literate)

Dipper specific

Dipper's job is to regularize a Source's data into RDF statements in such a way they may be combined with statement from other Sources in meaningful ways. To do this the Sources data is typically reduced and decorated with ontologic terms.

This leads to some rules of thumb.

Many Sources have data that is more specific than is will be useful in Dipper strive to omit what you can and provide links back to the Source for the details.

The Source's data is often the subject of an association. That association is always an ontological term (the RDF predicate) Often the RDF object will be an ontological term but other times just a terminal fact.

If a Source has non ontological data, that parallels ontological data write a separate local translation table from the Sources concepts to the ontological concept concept's label to be used in the code. see: https://github.com/monarch-initiative/dipper/blob/master/translationtable/ All but "GLOBAL_TERMS" are local translation tables where the key is a term from the source and the value is a label from an ontology. [1]

If a Source uses an Ontological term but we want to use another ... tough luck. We will need a bridging ontology or convince the Source to adopt our preferred terms. In particular in is never okay to change what a Source has said by swapping in our "equivalent" ontology term. (later processes may "unify" them but not here).

Use the Source's preferred URLs (even if they make you cringe). Dipper's output is still published as a public document and not a streamlined internal ontology optimized format with only regularized IRIs for efficient processing. At issue is: not all Sources URL identifiers are valid/hygienic RDF IRI. If you are a Source; please have pity on us poor robots and don't use clever URLs.

When choosing Ontological terms to associate with the data, first try to use terms that may already be in use (in other dipper ingests). see:
https://github.com/monarch-initiative/dipper/blob/master/translationtable/GLOBAL_TERMS.yaml

Terms found in OLS/Ontobee are strongly preferred. Terms resolving to coherent non circular definitions are preferred. Terms with succinct labels are preferred. Terms that resolve to a json blob describing Amazon S3 bucket allocations are not.

Tools to help with Modeling are a hard subject with no clear best answer.

I try to use only use tools which read/write text files that live in version control. e.g. https://github.com/TomConlin/dipper/blob/bgee_redo/resources/bgee/README.md

Foot note [1]: It is inevitable we will need to resort to namespacing labels in our translation tables, but somehow, so far, we have not had a label collision we could not deal with. So the initial "temporary" mapping without namespaces set up years ago still manages to get by.

matentzn commented 4 years ago

Hey @TomConlin Happy NY, and thanks for the input! Back from my leaf now, will get back to this early February! Thanks a ton.

monarch-initiative / dipper

Question: emerging source wants to know how to best model data in RDF #869