Open cmungall opened 8 years ago
As a patch we could check if the gene and genotype are from the same taxon here: https://github.com/monarch-initiative/configs/blob/master/SciGraph/golr/queries/gene-phenotype.yaml#L18
MATCH (subject)-[:RO:0002162]->(taxon)<-[:RO:0002162]-(genotype)
EDIT: this doesn't work since some genotypes are not mapped to their taxon.
Is this issue primarily with GENO, or APP? Or both?
It's a modelling issue, we need to separate out transgenic elements in the model so they are not included in the results of the cypher query. The above cypher hack didn't work because some genotypes are not mapped to a taxon.
I think this happens for Flybase specifically because it doesn’t yet conform to the GENO standard for modeling transgenes. Once it does, and once we update cypher queries to use GENO has_variant_part instead of the more generic BFO has_part, we will be able to more precisely filter out things to which we don’t want to propagate phenotypes (such as expressed transgenic features from genes other species).
@kshefchek, from the flybase test files it looks like genotypes are mapped to their taxon - so your hack above should work as a fix for issues arising from flybase, right? For other sources, the taxon sometimes hangs from the background, or the allele. So we would have to use taxon linked to these entities in the check.
https://monarchinitiative.org/phenotype/HP%3A0000708
This makes it look as though there are only three species, when I suspect the real issue is that species are missing. Below is a list of the subject IDs corresponding to the missing taxa.
:genid-nodeid-:MPD-strain116-m-genotype :genid-nodeid-:MPD-strain1227-f-genotype :genid-nodeid-:MPD-strain1233-m-genotype :genid-nodeid-:MPD-strain14-m-genotype :genid-nodeid-:MPD-strain2-f-genotype :genid-nodeid-:MPD-strain2271-m-genotype :genid-nodeid-:MPD-strain233-m-genotype :genid-nodeid-:MPD-strain24-m-genotype :genid-nodeid-:MPD-strain2614-f-genotype :genid-nodeid-:MPD-strain2624-f-genotype :genid-nodeid-:MPD-strain2633-m-genotype :genid-nodeid-:MPD-strain2718-m-genotype :genid-nodeid-:MPD-strain2750-f-genotype :genid-nodeid-:MPD-strain2754-m-genotype :genid-nodeid-:MPD-strain2798-f-genotype :genid-nodeid-:MPD-strain2801-m-genotype :genid-nodeid-:MPD-strain2809-m-genotype :genid-nodeid-:MPD-strain2903-m-genotype :genid-nodeid-:MPD-strain2909-m-genotype :genid-nodeid-:MPD-strain2916-f-genotype :genid-nodeid-:MPD-strain2927-f-genotype :genid-nodeid-:MPD-strain2950-m-genotype :genid-nodeid-:MPD-strain2987-m-genotype :genid-nodeid-:MPD-strain3-m-genotype :genid-nodeid-:MPD-strain4-f-genotype :genid-nodeid-:MPD-strain40-f-genotype :genid-nodeid-:MPD-strain41-f-genotype :genid-nodeid-:MPD-strain424-f-genotype :genid-nodeid-:MPD-strain47-m-genotype :genid-nodeid-:MPD-strain50-m-genotype :genid-nodeid-:MPD-strain53-m-genotype :genid-nodeid-:MPD-strain569-f-genotype :genid-nodeid-:MPD-strain58-m-genotype :genid-nodeid-:MPD-strain6-m-genotype :genid-nodeid-:MPD-strain70-f-genotype :genid-nodeid-:MPD-strain735-m-genotype :genid-nodeid-:MPD-strain776-f-genotype :genid-nodeid-:MPD-strain8-m-genotype dbSNPIndividual:13588 MGI:2653759 MGI:2679652 MGI:3041455 MGI:3512564 MGI:3625034 MGI:3625957 MGI:3628769 MGI:3695037 MGI:3717731 MGI:3763742 MGI:3844860 MGI:4356104 MGI:5449937 MGI:5449941 MGI:5449978 MGI:5449989 MGI:5450223 MGI:5450475 MGI:5450489 MGI:5450510 MGI:5450514 MGI:5450542 MGI:5450546 MGI:5450548 MGI:5450562 MGI:5450571 MGI:5450589 MGI:5450619 MGI:5450665 MGI:5504657 MGI:5756799 MONARCH:00b0b49dc612442a830f12386ecb7326 MONARCH:0e238cbb19add5a084fceef350caadcb MONARCH:2f5105d89745e22a7d342111a9390837 MONARCH:36d3bcfd3c800f249fab2a306a9820e6 MONARCH:390f92af5b93983881088a3745981d0e MONARCH:41b9ec8104c5ca6480542c3b4a2bcec1 MONARCH:4298d00d727a3462bf7c550729554068 MONARCH:49b2cb961589c6b148a4804666ea7dcd MONARCH:4ca74870487bab0e2d2a47462dd4df89 MONARCH:4ed4388f9d59088539b4faca8812701b MONARCH:56723b55e2fd4ef66d0e177c5bf15b26 MONARCH:607d2b9739fc459f856ebc3d6c31e368 MONARCH:607d36b3918322209ab05e0c6bd03e57 MONARCH:60f776d58a837be977a6800984658b17 MONARCH:6187c002e45374bc1aab5f9926fb8f43 MONARCH:66c3cbb97063dbd89f8b53090fa4fb5d MONARCH:67d0f47c0686a19fa9de84795781f220 MONARCH:69f9d4849e153e0951d2237d495682d7 MONARCH:6c984c28305af23c3fe03a3797b27e9a MONARCH:70427458f7290a83cd77a28538ccd6fe MONARCH:73c12d767e0fb894975660801032362c MONARCH:756c6b082adf625ff854eca5d0488625 MONARCH:8ccf1bafd3e80b320757c02c8371ecdc MONARCH:948e3bb48a1f7ea49dde283292251eff MONARCH:950b86895c62f72a8897d0f144a79cc5 MONARCH:a2ee56c347bac5df8c55feb8ac8b3ac0 MONARCH:a88f9449c6ad9d9b859479496ebc98a0 MONARCH:b03e8c16e92fecbec091b49bb67e969e MONARCH:b2353d1f4475f755ec30431ee9565f0a MONARCH:b3bfa63d0abdad7066ffc30ba901c67b MONARCH:bab5a51885a87778209c28f490ba0af9 MONARCH:bafb62d194106c814104e8e971a4d242 MONARCH:c35e8261d2445ce086d8183996c014e1 MONARCH:c8853f740589e5e62f1f353946742647 MONARCH:cd795c42e89f3530486ec079391a34fb MONARCH:cda4bf6ed6dceb89a81aa91349bdf446 MONARCH:d567dffc36c9f3dcfe6cf213842a79cd MONARCH:d6bfd82ccd9c2c9280986cc6d2fcc6c4 MONARCH:d870005f1894ce75a6b77f048b1464a5 MONARCH:dc357533a30937a568f06488be35b486 MONARCH:dd9e3b01485a2f69e0352de648f8ac60 MONARCH:e2958473199d66a11533b1aef08594a3 MONARCH:e9bb2f5d81301b72d4474e081e2bcaad MONARCH:ed0ca9a55cc284dcc1755479f3895d09 MONARCH:f42181f09aee0e4b3daaedd0fdd1a610 MONARCH:f44aba2e970c9a3ddfa730cb6f18e2f3 MONARCH:fda5840cfee238bbb0491e76af92af1b MONARCH:fe240b652e5d292ae32cb304c60b0603
Related to the original issue from @cmungall, we should decide on rules of when to propagate phenotypes to genes represented in a transgene expressed in some model organism. A simple approach here is to follow @kshefchek proposal and say that if the transgene derives from a different species than its host, we dont propagate the phenotype to this gene. But I think we would still want some way to record the fact that, for example, expression of a yeast gene in a fly system results in a behavioral phenotype in the fly. With the blunt approach above we loose this. So we should think about how to represent and display this for users.
For the yeast gene in this example, its link to a fly behavioral phenotype could be qualified by the fact that this association applies in the context of expression in a fly. One view of this is that the 'environment' linked to the gene-phenotype association is 'in a transgenic fly system'.
A related issue here is that the example gene in the original ticket here is Scer\GAL4 - whose expression is used as a 'tool' to drive downstream expression of the true gene under investigation - and therefore should be filtered from inheriting phenotypes on this fact alone. This capability will come as we define rules and modeling for improving our phenotype propagation legitimacy.
Another example to test when fixing this is that human alleles from OMIM are missing the propagation of taxon. I don't know if we are systematically not representing alleles properly, or not representing OMIM properly or both.
still an issue
https://monarchinitiative.org/phenotype/FBbt:00000053PHENOTYPE#genes
this really complicates things like dynamic IC calculation. At the very least we need a way of filtering these in the solr schema
This leaks out into biolink calls, tripped up @realmarcin
just catching up here, interesting issue actually
would another route to make this more functional be to report back or use as part of the query the feature id namespace? that could perhaps be a separate API call, to return the possible namespaces that would come back for a query. though this doesn't solve taxon mappings per se...
my interim hack is to parse the namespace label from the feature id and then blunt filter post-API call
@mbrush
@kshefchek, from the flybase test files it looks like genotypes are mapped to their taxon - so your hack above should work as a fix for issues arising from flybase, right? For other sources, the taxon sometimes hangs from the background, or the allele. So we would have to use taxon linked to these entities in the check.
Now that we get G2P associations directly from MGI and ZFIN, can we move forward with this adjustment? Or would this affect wormbase associations? I can check if you're unsure.
I ditched the taxon genotype check proposed above: https://github.com/monarch-initiative/monarch-app/issues/1310#issuecomment-234082310, it added too much time to the loader and would have required updates to at least IMPC. As an alternative I removed the relation GENO:0000639 ! sequence_derives_from, as the code seems to indicate that we use this relation for transgenes, as opposed to GENO:0000418 ! has_affected_locus and all subclasses.
Here are the counts on production and beta.
This has fixed many of the transgene issues, most importantly we no longer have yeast transgene to fly phenotypes. The downside is that we have lost our drosophila g2p associations for species other than melanogaster. There are also some persistent transgene issues (a few viral) from flybase. Curious what everyone's thoughts are on moving forward with this.
wanted to bump this ticket one more time before making the changes (although they are easy to undo). @cmungall any thoughts?
https://monarchinitiative.org/gene/FlyBase:FBgn0014445
FlyBase assigns this to be a Scer (yeast) gene, which is of course correct
But this means that we end up with 6 "yeast genes" with nervous system phenotypes:
https://monarchinitiative.org/phenotype/UBERON:0001020PHENOTYPE
(use filters to see)