monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

BGEE: Write a dipper importer for gene expression data #147

Closed cmungall closed 7 years ago

cmungall commented 9 years ago

bgeedb have done all the work here, no need to go to individual MODs, http://bgee.org/

@fbastian -- this would be to create turtle files from the bgee downloads (with full attribution of course). These could then be used to make monarch anatomy pages, analogous to http://beta.monarchinitiative.org/phenotype/HP:0000315 -- not sure what yr current plans are for a UI here, maybe we could join forces here?

fbastian commented 9 years ago

We don't currently have plans for an "anatomy" page. But of course we would be glad to join our efforts, and that could be the opportunity for us to develop such a page. Contact me whenever you want.

mellybelly commented 9 years ago

I would be so incredibly delighted to have you help us with anatomy pages! with ontology navigation, pictures, homologies, ooohhh yah.

cmungall commented 8 years ago

bgee provides expression values and expression calls: http://bgee.org/?page=download

We would use the already pre-processed calls

http://bgee.org/?page=download&action=expr_calls#id1

options are:

I suggest leaving stages til a second iteration. Of the first two, we would seek guidance. As a first pass we would not use this algorithmically in owlsim (that will take further consideration, and we'd probably want to include other sources). This would simply be a list of tissues on a gene page, and a way of providing a good uberon page that shows relevant genes.

Either way we'd not do our own processing. We'd take the bgeedb tsvs and make a simple OBAN type model of each row, using RO:0002206 ! expressed in

mellybelly commented 8 years ago

and include the negation in first iteration too? I think that is important

kshefchek commented 8 years ago

Do we want to filter out poor call quality? What are the RO classes for over/under expression, present/absent.

mellybelly commented 8 years ago

I would think so, at least in first pass. alternatively could put a UI flag that highlights that fact.

mbrush commented 8 years ago

For the basic expression data, seems like each oban:association would represent a ternary relationship between a gene, anatomy, and stage. The association would link the gene (subject) and anatomy (object) using RO:0002206 ! expressed in (predicate) - with the stage as a qualifier hanging from the association (analogous to G2P associations where stage is a qualifier on the association). Looks like there is some simple evidence metadata as well, (methodologies, p-values). Any objections to bringing this in on first pass?

A couple questions:

  1. Should stage be an identity criteria for an association (i.e. do we create separate associations when gene is expressed in anatomy at different stages, or create one association that is qualified with all relevant stages).
  2. Regarding negation (gene not expressed in anatomy) - what is the preferred pattern to capture this? Do we need an _expression_absentin relation? Or should we capture the negation in another way?
cmungall commented 8 years ago

Keep it simple - one association per row

negation: absent-in is problematic, as the corresponding OWL all-some expression is potentially confusing, as is the behavior in closures. See @balhoff's paper on this.

consider absence of expression in limbs. Formally g1 subClassOf not (expressed-in some limb)

The standard pattern is to populate the closure up. This would mean a query for 'appendage' would show the limb-absent gene. And a query for 'forelimb' would not.

To fix this, the closures would have to go down. But this is a major space explosion.

And, it's debatable whether it gives meaningful results. Someone querying for 'appendage' may be interested in patterns for all appendages in general. In which case the supposedly logically incorrect result is the desired one.

But we are getting ahead of ourselves. We may decide that "Over-/Under-expression across anatomy" tsv is more meaningful. In this case the simplest thing is to store direction as a qualifier, and use standard upwards closure.

fbastian commented 8 years ago

Hi, I address some of the comments here. Great to see things getting started!

Type of calls

  • Presence/Absence of expression
  • Over-/Under-expression across anatomy

Of the first two, we would seek guidance

My advice would definitely be to start with "presence" expression calls. The reasons are:

1) "Presence" calls are what we currently use in TopAnat (http://bgee.org/?page=top_anat), and I think it clearly demonstrates how powerful this information is, and how much biological information is included in these "present" calls. We don't even use "absent" calls in TopAnat (no distinction between NA and "negative association" in enrichment analyses)

2) we have built a new sorting algorithm, capable of ranking conditions where a gene is expressed, according to their relevance in term of gene function (it uses some nonparametric stats based on expression levels, and some magic to normalize and weight the ranks coming from different data types and experiments). Try the examples on our gene search page, and try also your own favorite gene: http://bgee.org/?page=gene Since we're now capable of making sense of the overwhelming amount of "present" calls produced, I would definitely display such a sorted list of terms in Monarch. The gene-condition ranks are not yet available from our download files, but it is very easy to provide. (of note, we're going to release an improved ranking very soon)

3) currently we have very few "over-/under-expression" calls. Because we specifically needs experiments studying a minimum number of conditions with a minimum number of replicates. This will improve over time of course. But I have to say, I spent lots of time playing with the "present/absent" calls, much less with the "over-/under-expression" calls, and I'm quite confident about the reliability of the "present/absent" approach.

Negative calls

and include the negation in first iteration too?

IMO, the "absent" calls are only useful when you need to distinguish between "NA" and "negative association". Do you have such a need in your analyses? (e.g., in TopAnat, we don't)

Call quality

Do we want to filter out poor call quality?

We're currently quite not happy with these call qualities, we want to improve that for next release. Our new ranking algorithm showed us that we have top conditions, very relevant, coming from poor quality call.

So it is definitely not a priority at the time being.

Call propagation

negation: absent-in is problematic, as the corresponding OWL all-some expression is potentially confusing, as is the behavior in closures.

The standard pattern is to populate the closure up. This would mean a query for 'appendage' would show the limb-absent gene. And a query for 'forelimb' would not.

Exactly, we propagate "present" calls to parent conditions, and propagate "absent" calls to child conditions (a "condition" being an anatomical structure at a developmental stage; we kind of generate a graph of "conditions", not only of anatomy). Of note, we filter out "absent" calls that are contradicted by a "present" call in a child condition.

To fix this, the closures would have to go down. But this is a major space explosion.

It is. To avoid this problem, in Bgee we restrict the propagation to conditions that were actually observed in any unpropagated annotations, to make sure we don't use all possible child conditions.

What I was planning to do is to use start/end stages from Uberon to propagate to any valid child conditions (at the moment it is difficult to determine what combinations of anatomical term and development stage are valid, e.g., "brain-carnegie stage 1" is invalid). Otherwise the propagation generates too many invalid combinations (e.g., absence of expression in nervous system at an early developmental stage, propagated to brain, while the brain is not supposed to exist yet at that developmental stage).

And, it's debatable whether it gives meaningful results.

I have to say that I'm not sure either.

Don't propagate "over-under-expression" calls

Also, please note that we do not propagate "over-under-expression" calls. We think it wouldn't make any sense. If a gene is over-expressed in a specific brain region, as compared to other brain regions, then it doesn't make any sense to say that it is over-expressed in the whole brain.

mbrush commented 8 years ago

Decisions on 2016-06-08 call that for our first pass, we will bring in just the over/under expression data for human, mouse, and fish. Absence/presence data was too abundant - would double the number of triples in our data.

Files: Over-/Under-expression across anatomy Over-/Under-expression across life stages

Approach: One association per row in the data (i.e. no aggregating stages if same gene-anatomy association is observed at >1 stage) Only pull high quality calls

Models:

1. Expression Across Anatomy:

An exemplar record, asserting that the TSPAN6 gene is over-expressed in the uterine cervix relative to other tissues, as measured during the post-juvenile adult stage:

Gene ID Gene name Anatomy ID Anatomy name Stage ID Stage name Diff. expression
ENSG00000000003 TSPAN6 UBERON:0000002 uterine cervix UBERON:0000113 post-juvenile adult stage over-expression

Triples for this record (association uses 'over/under-expressed in' relations):

:association1   OBAN:has_subject      ENSEMBL:ENSG00000000003 ! TSPAN6
:association1   OBAN:has_object       UBERON:0000002 ! uterine cervix
:association1   OBAN:has_predicate    RO:overexpressed_in 
:association1   GENO:has_qualifier    UBERON:0000113 ! post-juvenile adult stage

ENSEMBL:ENSG00000000003   RO:overexpressed_in   UBERON:0000002

2. Expression Across Stages

An exemplar record, asserting that Gnai3 is over-expressed during the 2 week stage relative to other stages, in the prefrontal cortex:

Gene ID Gene name Anatomy ID Anatomy name Stage ID Stage name Diff. expression
ENSMUSG00000000001 Gnai3 UBERON:0000451 prefrontal cortex MmusDv:0000046 2 weeks over-expression

Triples for this record (association uses 'over/under-expressed during' relations):

:association1  OBAN:has_subject     ENSEMBL:ENSMUSG00000000001 ! Gnai3 
:association1  OBAN:has_object      MmusDv:0000046 ! 2 weeks
:association1  OBAN:has_predicate   RO:overexpressed_during 
:association1  GENO:has_qualifier   UBERON:0000451 ! prefrontal cortex

ENSEMBL:ENSG00000000003    RO:overexpressed_during   UBERON:0000002

Note that the model requires FOUR NEW RELATIONS in RO: The proposal uses different relationships to describe differential expression across anatomy vs differential expression across stages. In the former case, the association describes a gene expressed in anatomical entity, as qualified by the stage during which this was observed. In the latter case, the association describes a gene expressed during a stage, as qualified by the anatomical entity within which this expression was observed. So modeling will require four new relations :

Stage and Anatomy Qualifiers: I proposed the simplest possible model for qualifiers in this pass - simply hanging the stage or anatomy term directly from the association. More complex/nuanced models of qualifying stage or anatomy may be implemented to address use cases from other data sources (e.g. ZFIN where stage qualifiers are a span between a start and end stage).

Thoughts from @cmungall @mellybelly, others before we request new RO terms and pull the data according to these models?

fbastian commented 8 years ago

Absence/presence data was too abundant - would double the number of triples in our data.

Did you use the simple file (with unpropagated data) or the complete file (with propagated data)? I suspect that if you extract only tissues and not life stages, from the simple file, the number of triples is reasonable.

The proposed relations seem reasonable to me. But I don't find it very clear that anatomy = comparing different organs at a same developmental stage, and development = comparing a same organ at different developmental stages. Only a matter of definition and of qualifier used, I guess.

mbrush commented 8 years ago

@cmungall can answer the first question, as he performed the initial size estimates.

Regarding

I don't find it very clear that anatomy = comparing different organs at a same developmental stage, and development = comparing a same organ at different developmental stages.

This reflects our understanding of the data contained in the two diff expression files, and what inc/dec expression is relative to in each. Specifically, that the expression across anatomy file contains assertions that gene X shows inc/dec expression in anatomy Y, relative to other anatomies at the same developmental stage. And that the expression across stages file contains assertions that gene X shows inc/dec expression during stage Y, relative to other stages for the same anatomy.

Does this make sense? The modeling aims to reflect this understanding of the data - so please correct us if we misinterpret.

fbastian commented 8 years ago

Does this make sense?

It does! I was just saying that it is hard to understand looking at the triples. But there's nothing wrong.

cmungall commented 8 years ago

Based on @fbastian's advice, and a brief explore of the the nice gene queries (hadn't seen this before), I'm really excited about getting started using only presence + ranking. These seem to be giving biologically meaningful results.

I know we haven't discussed the modeling of ranked associations in monarch yet, so this may seem like scope creep, but I think this is a good place to start. It should be fairly straightforward - just an integer or float property, and we add as a sort field in the golr yaml.

Number of associations: if we count only unique gene-tissue present pairs there are 3.5m. This should be a drop in the ocean for us, but due to certain things gumming up the works we should pause here and perhaps punt til the next release. We have been discussing optimizations that should help here. One workaround may be if we can meaningfully apply a cutoff for the ranked lists.

fbastian commented 8 years ago

For tissue specific genes, I never look behind the top 10 terms, as for a google search :p So, keeping only something like the 20 top tissues should make sense. For ubiquitously expressed genes, it's different, the ranking doesn't make much sens I think. Maybe you want to represent the expression of such genes differently?

FYI, we have a very good score for estimating tissue specificity. This is something I hope to be able to release for mid-July. So you should be able to distinguish tissue-specific from ubiquitously expressed genes.

I can provide TSV files with the gene-tissue-stage-score information. I need to generate them this week for another application anyway.

cmungall commented 8 years ago

On Wed, Jun 15, 2016 at 3:36 AM, fbastian notifications@github.com wrote:

For tissue specific genes, I never look behind the top 10 terms, as for a google search :p So, keeping only something like the 20 top tissues should make sense. For ubiquitously expressed genes, it's different, the ranking doesn't make much sens I think. Maybe you want to represent the expression of such genes differently?

In RO we have 'expressed ubiquitously in'

So we would use this in conjunction with 'whole organism'

FYI, we have a very good score for estimating tissue specificity. This is something I hope to be able to release for mid-July. So you should be able to distinguish tissue-specific from ubiquitously expressed genes.

Great!

I can provide TSV files with the gene-tissue-stage-score information. I need to generate them this week for another application anyway.

That would be perfect.

fbastian commented 8 years ago

Here are the files: ftp://ftp.bgee.org/bgee/current/download/ranks/, see README. Don't hesitate to ask if anything is unclear, of if you need additional data.

We have actually released an updated version of our ranking, giving even better results! And also updated our gene page to display the rank scores, it's very informative for detecting shifts in gene expression (and thus in gene function). Some examples:

mbrush commented 8 years ago

@cmungall @mellybelly @jmcmurry The DIPpers discussed a simple first pass ingestion of this source, using the new files @fbastian provided above that contain expression rankings. The data is described in the README file here - ftp://ftp.bgee.org/bgee/current/download/ranks/README.md. Can we get feedback on the following proposal:

For a first pass we would pull only 'presence' data, and ignore stage for now. So we would target the files from the 'anat_entity' grouping, which collapse stage qualifiers and provide one row per unique gene-anatomy pair. Within this grouping, we will use the 'all_data' files, that include associations/ranks based on all technologies (affy, rna seq, est, in situ).

A few questions:

  1. What species should we target initially (human, mouse, fish, . . . )?
  2. Previous comments in this ticket discussed handling ubiquitously expressed genes differently. Thoughts here? Frederic hoped he would have scores to identify these by mid-july, but not sure if these are ready. If they are, we need to discuss how modeling and display of these would differ.
  3. Do we want to define a cutoff for gene-anatomy associations we pull? e.g. based on the rank score itself, or pulling only the top x number of associated tissues for each gene?

As for the modeling, we propose creating OBAN associations where the subject = gene, the predicate = RO:expressed_in, and the object = anatomy. The only other piece of data is the rank, which could be captured as a quantifier hanging from the oban:association - but not sure what relation or pattern to use here? We could define a datatype property with a float range to directly link the association to its rank score. Or it might be better to define an object property to link the association to an owl individual representing the score, so that we can define the type of quantifier it represents. And then use a generic has_value datatype property to link the score individual to its numerical/float value.

mbrush commented 7 years ago

Documenting specifics of first pass ingest here, to be executed in January 2017

Targeted Data:

Data Downloads:

Files to ingest in first pass:

mbrush commented 7 years ago

Questions (@cmungall, @mellybelly) :

1. Confirm decisions to target human, mouse, fish in first pass, and only pull top 20 calls per gene

2. How handle ubiquitously expressed genes? Frederic hoped he would have scores to identify these last summer, but not sure if these are ready - can you comment @fbastian? If we can identify these, we may want consider how to limit the number of returned associations here (as rankings are really as relevant for ubiquitous genes). And we can use the RO:0002291 ! ubiquitouly_expressed_in as our property in the association.

3. Modeling Ranks: Bgee provides numerical 'ranks' for their gene-anatomy associations, that seem to be based on factors such as tissue specificity, expression level relative to other tissues, number of 'evidence lines' supporting expression in the tissue, and location of annotated functions/biological processes for a given gene. See blog post here: https://bgeedb.wordpress.com/2016/03/11/new-gene-page-new-bgee-interface/.
Simplest modeling approach is to define a datatype property (e.g. has_quantifier, or more specifically has_rank, or even more specifically has_bgee_rank) that captures the rank from the association as a literal.

  :bgee_association1    has_quantifier    "245"

But this tells us nothing of what this rank means. We could provide some context/help here through various approaches (e.g. add an informative dc:description to the association, use the 'bgee_rank' specific datatype property, create a bgee-rank data type and instantiate in the data) - these are documented in ticket #406.

cmungall commented 7 years ago
  1. Yes
  2. Let's do this on a second round
  3. I answered on #406 but don't we want the rank score here and not the rank?
mbrush commented 7 years ago

don't we want the rank score here and not the rank?

Yes we want to capture the score that Bgee calls 'rank' - not the rank ordering of the association. Sorry this want clear in my description - the term 'rank' for this calculated quantitative measure is a bit misleading. See my further comments in #406.

kshefchek commented 7 years ago

Added with https://github.com/monarch-initiative/dipper/pull/407