Open cmungall opened 8 years ago
I think we will have to deal with an 'unsatisfactory' solution in the near term, esp with a looming July deadline to clean up as much of the spurious inferences as possible. After exploring the ZFIN and MGI 'clean' gene-phenotype associations, I think these do a much better job than we could do in the near term by writing cypher filters to prevent incorrect propagations. So I concur with Chris that likely the best approach for now is to use the source 'clean' inferences as possible to overwrite our own, and do the best we can with cypher-based filters for the rest.
One issue here is that the clean files only cover gene-phenotype inferences, but in Monarch we also infer and display variant-phenotype associations. So if we use the source 'clean' files for genes, we still need to write our own cypher solution for propagation to variants that will not be totally aligned with the propagation to genes from sources. But this is fine for now, esp. as OWLSim/phenogrid operates at the gene-phenotype level, and it is more important that these are as correct and complete as possible.
Given the data we have to work with, I think the best cypher-filtering solution we could implement for July is to: (1) filter genotypes with more than one affected native gene, and of the genotypes that remain, (2) filter those with one affected gene and one or more transgenic insertion. Subsequent phenotype propagation will be performed only on the genotypes that pass these filters. This is a more conservative approach than is taken by MGI and ZFIN in generating their inferences, as they have additional rules to allow propagation if the transgenic insertion is phenotypically 'inert' (e.g. it expresses only a marker or reporter gene, or expresses only a Cre recombinase). Not having these rules to permit such genotypes to sneak through our crude filters will result in many correct propagations not being made in our data - but we don’t have the flags in our ingested data to implement these rules right now, and it is better to be conservative the inferences we make.
In summary (essentially agreeing with Chris here), let's:
@kshefchek does the short term plan here sound reasonable and feasible, given what cypher and our processing capability can do?
This is a good summary.
Short term plan: One possible variant of the above is to make source-specific clean/pre-asserted files ourselves for sources that don't provide them, and to do this at dipper time. This makes the cypher processing simpler (and we can work on the cypher queries without disrupting things). It's slightly unsatisfactory in the we end up burying the logic in the python rather than having it declarative. But it may be the simplest thing (or it may turn out to be harder, especially if the processing requires joining of >1 file). I'm not familiar enough with the code to make the call, but just wanted to put the option out there.
Note sure if this is the best ticket to add this, but an example of MGI derivation rules: http://f1000research.com/posters/5-742
Where are we on this? This came up on the AGR call today. @pnrobinson @selewis
There's a lot in this ticket. To answer the original question:
- If an asserted inference is provided, ingest it in addition to our precise modeling
This our current approach, this is implemented for ZFIN and MGI
- use this as a gold standard to test our own inference rules (or alternately, to test where the source-asserted inferences lack sufficient precision or discriminatory ability; e.g. the double mutant use case)
This is a good idea and will be helpful when we adjust our inference approach. We already know we have issues with transgenic elements and double mutants (although fingers crossed this fixes the transgene issue.
- Override our own inferences where source inferences are provided
This is our approach for MGI and ZFIN.
- Once we are confident of our own inferences, jettison the source-asserted inferences
Depends on 2.
Short term plan: One possible variant of the above is to make source-specific clean/pre-asserted files ourselves for sources that don't provide them, and to do this at dipper time...it may turn out to be harder, especially if the processing requires joining of >1 file
Also not a bad idea, we would have to look at each source to assess the feasibility. I don't think we do much cross source joining for G2P (other than gene->disease->phenotype).
Our general approach is to model with high granularity/precision, and to infer more simplistic assertions.
The canonical example is ingesting:
gene <-> genotype
genotype <-> phenotype
And then inferring
gene <-> phenotype
The inference is typically implemented in Cypher, and the inferred edge is materialized in golr (a separate issue is whether we should have more inference in our graph db, or whether we should cache inferences there, or have some other way to reuse the logic of the cypher queries).
In some cases the inference rules may be complex, or there may be different rules for different relationship types (in the above
<->
stands for a generic has-phenotype, but we can imagine more specific relations in GENO).There are cases where the following two things hold: (i) our inference rules are not complete or correct, or our ingested normalized graph is not correct in some way (ii) the source provides their own "inferences", often in the form of a different TSV (e.g. 'clean fish').
Here I suggest the following strategy
The challenging part is 3. If we can get source-asserted inferences for all sources [on a per datatype basis] then we can do this. However, 3 gets awkward if we want rules like "for ZFIN, use sources-asserted; for FOO, use our inferences". It should not be so hard to implement this as clauses in the Cypher, but this feels a little unsatisfactory.
Comment @mbrush ?