monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Refactor to exclude assertions that are redundant with data from existing source. #289

Open mbrush opened 8 years ago

mbrush commented 8 years ago

From https://github.com/monarch-initiative/monarch-app/issues/1224: "take care in dipper to exclude inclusion of derived assertions that are redundant with a source we already bring in".

Issue reported here was with taxon labels. But will apply elsewhere (e.g. gene labels). Perhaps we can start by enumerating all possible manifestations of this issue - where we want a single-source-of-truth (SSOT):

  1. Labels for entities for which we have an authoritative source of label
    • genes (NCBIgene.ttl file)
    • taxa (NCBITaxon ontology)
  2. Related here is the issue of pulling genomic coordinates for genes/variants from different sources. For example, zfin.ttl contains genomic coordinates for danio rerio genes. It seems like gene coordinate information could be housed in a single file, rather than distributed across various source ttl files. And perhaps there is a single authoritative source for gene coordinate information we could use for all taxa - rather than rely on things like MODs which may contain conflicting or outdated information? @cmungall @mellybelly?
cmungall commented 8 years ago

Sounds good

Note that dipper deals with the pre-clique-merged view. So NCBIGene is canonical for anything with an NCBIGene ID, and ZFIN is canonical for ZFIN IDs, even when talking about the same gene. Further downstream we may merge these, and here we decide who is the representative IDs, but this is a different question that who is authoritative for the original ID. Just pointing that out so we don' trip ourselves up.