monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Document design pattern for ingesting asserted inferences #324

Open cmungall opened 8 years ago

cmungall commented 8 years ago

Our general approach is to model with high granularity/precision, and to infer more simplistic assertions.

The canonical example is ingesting:

  1. gene <-> genotype
  2. genotype <-> phenotype

And then inferring

The inference is typically implemented in Cypher, and the inferred edge is materialized in golr (a separate issue is whether we should have more inference in our graph db, or whether we should cache inferences there, or have some other way to reuse the logic of the cypher queries).

In some cases the inference rules may be complex, or there may be different rules for different relationship types (in the above <-> stands for a generic has-phenotype, but we can imagine more specific relations in GENO).

There are cases where the following two things hold: (i) our inference rules are not complete or correct, or our ingested normalized graph is not correct in some way (ii) the source provides their own "inferences", often in the form of a different TSV (e.g. 'clean fish').

Here I suggest the following strategy

  1. If an asserted inference is provided, ingest it in addition to our precise modeling
  2. use this as a gold standard to test our own inference rules (or alternately, to test where the source-asserted inferences lack sufficient precision or discriminatory ability; e.g. the double mutant use case)
  3. Override our own inferences where source inferences are provided
  4. Once we are confident of our own inferences, jettison the source-asserted inferences

The challenging part is 3. If we can get source-asserted inferences for all sources [on a per datatype basis] then we can do this. However, 3 gets awkward if we want rules like "for ZFIN, use sources-asserted; for FOO, use our inferences". It should not be so hard to implement this as clauses in the Cypher, but this feels a little unsatisfactory.

Comment @mbrush ?

mbrush commented 8 years ago

I think we will have to deal with an 'unsatisfactory' solution in the near term, esp with a looming July deadline to clean up as much of the spurious inferences as possible. After exploring the ZFIN and MGI 'clean' gene-phenotype associations, I think these do a much better job than we could do in the near term by writing cypher filters to prevent incorrect propagations. So I concur with Chris that likely the best approach for now is to use the source 'clean' inferences as possible to overwrite our own, and do the best we can with cypher-based filters for the rest.

One issue here is that the clean files only cover gene-phenotype inferences, but in Monarch we also infer and display variant-phenotype associations. So if we use the source 'clean' files for genes, we still need to write our own cypher solution for propagation to variants that will not be totally aligned with the propagation to genes from sources. But this is fine for now, esp. as OWLSim/phenogrid operates at the gene-phenotype level, and it is more important that these are as correct and complete as possible.

Given the data we have to work with, I think the best cypher-filtering solution we could implement for July is to: (1) filter genotypes with more than one affected native gene, and of the genotypes that remain, (2) filter those with one affected gene and one or more transgenic insertion. Subsequent phenotype propagation will be performed only on the genotypes that pass these filters. This is a more conservative approach than is taken by MGI and ZFIN in generating their inferences, as they have additional rules to allow propagation if the transgenic insertion is phenotypically 'inert' (e.g. it expresses only a marker or reporter gene, or expresses only a Cre recombinase). Not having these rules to permit such genotypes to sneak through our crude filters will result in many correct propagations not being made in our data - but we don’t have the flags in our ingested data to implement these rules right now, and it is better to be conservative the inferences we make.

In summary (essentially agreeing with Chris here), let's:

  1. Ingest and use the MGI and ZFIN 'clean' gene-phenotype associations (and any other we can find) to overwrite our own gene-phenotype inferences for these sources. As Chris mentioned here, we will need awkward rules in our cypher such as "for ZFIN, use sources-asserted; for FOO, use our inferences".
  2. Write cypher filters as best we can for variant-phenotype inferences for ZFIN and MGI, and for gene-phenotype and variant-phenotype associations for all other sources. - and perform pheno propagation only on genotypes/features that pass these filters.
  3. Longer term, we can use the rules from source clean files to inform and test inference rules of our own that can ultimately obviate the need to use inferences from sources.

@kshefchek does the short term plan here sound reasonable and feasible, given what cypher and our processing capability can do?

cmungall commented 8 years ago

This is a good summary.

Short term plan: One possible variant of the above is to make source-specific clean/pre-asserted files ourselves for sources that don't provide them, and to do this at dipper time. This makes the cypher processing simpler (and we can work on the cypher queries without disrupting things). It's slightly unsatisfactory in the we end up burying the logic in the python rather than having it declarative. But it may be the simplest thing (or it may turn out to be harder, especially if the processing requires joining of >1 file). I'm not familiar enough with the code to make the call, but just wanted to put the option out there.

cmungall commented 8 years ago

Note sure if this is the best ticket to add this, but an example of MGI derivation rules: http://f1000research.com/posters/5-742

cmungall commented 7 years ago

Where are we on this? This came up on the AGR call today. @pnrobinson @selewis

kshefchek commented 7 years ago

There's a lot in this ticket. To answer the original question:

  1. If an asserted inference is provided, ingest it in addition to our precise modeling

This our current approach, this is implemented for ZFIN and MGI

  1. use this as a gold standard to test our own inference rules (or alternately, to test where the source-asserted inferences lack sufficient precision or discriminatory ability; e.g. the double mutant use case)

This is a good idea and will be helpful when we adjust our inference approach. We already know we have issues with transgenic elements and double mutants (although fingers crossed this fixes the transgene issue.

  1. Override our own inferences where source inferences are provided

This is our approach for MGI and ZFIN.

  1. Once we are confident of our own inferences, jettison the source-asserted inferences

Depends on 2.

Short term plan: One possible variant of the above is to make source-specific clean/pre-asserted files ourselves for sources that don't provide them, and to do this at dipper time...it may turn out to be harder, especially if the processing requires joining of >1 file

Also not a bad idea, we would have to look at each source to assess the feasibility. I don't think we do much cross source joining for G2P (other than gene->disease->phenotype).