monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

@base_ landing page #380

Open TomConlin opened 8 years ago

TomConlin commented 8 years ago

In our weekly dipper calls I have been agitating for a landing page for the vast number of unresolvable IRI we generate. (Where vast is defined as > 10,000,000 RDF statements) So far we haven't come up a reason to keep it as is.

Highest value source of these unresolvable IRI are @base underscore nodes which we seem to use for terms we wish had proper URI.

In curie form they are expressed as :_foo which expands (or should expand to):
https://monarchinitiave.org/_foo.

There is no technical RDF reason we place a underscore after the colon (and sometimes we don't).

When the monarch web server has no idea what "_foo" is (always?) it returns 404s.

Ideally these would go somewhere more useful, but if they cannot, I propose we have a destination page that acknowledges their existence.

There is a range of options between offering the same generic static page for everything
and dynamic pages which include associated label, type, parent/children etc.

Something along the lines of:

    https://monarchinitiative.org/iri/foo

                    foo

    As far as we know this term "foo" does not have its own
    persistent resolvable url so we are using this one for now. 
    If you are aware of a another resource page for “foo” 
    please let us know us at  a@b.c

    This "foo" has the label "bar" and type of "fred".
    "foo" is linked to from:
        "something(s)"
    and links to:
        "otherthing(s)"

In this scenario we leave the @base prefix ":" alone as it has other legitimate uses.

Choose a specific @prefix e.g. IRI : https://monarchinitiative.org/iri/
Which we use when we need a quick and dirty IRI for a wayward term.

Make a landing page to capture anything under /iri/ that tries to be as helpful as possible.

I have not checked, but I expect almost all of these @base_ terms will be leaf nodes in our graph where the only reason for them not to be literals is the desire to attach properties.

Being able to include the ingest source and date would be awesome.

cmungall commented 8 years ago

Why can't we simply make these resolvable?

if dipper extracts it, it will be in scigraph-data. If it's in there we can query and find it's neighbors and categories and labels, and display a generic node landing page (not worry too much about making this look great, as you say goal is to not 404)

TomConlin commented 8 years ago

That is pretty much exactly what I am requesting. I am not particularly attached to how it happens, the details above are just a suggestion to give something to chew on.

TomConlin commented 8 years ago

For background here are the most frequently used base iri

head base_iri.dist
  50395 <https://monarchinitiative.org/_fbcvtermkey92310>
  19104 <https://monarchinitiative.org/_cattle-linkagechrX-UN-UN-Region>
  19104 <https://monarchinitiative.org/_bosTau7chrX-UN-UN-Region>
  14750 <https://monarchinitiative.org/_fbcvtermkey60494>
    3098 <https://monarchinitiative.org/_cattle-linkagechr4-UN-UN-Region>
    2165 <https://monarchinitiative.org/_10090chr1-UN-UN-Region>
    2095 <https://monarchinitiative.org/_10090chr7-UN-UN-Region>
    2006 <https://monarchinitiative.org/_10090chr2-UN-UN-Region>
    1999 <https://monarchinitiative.org/_10090chr4-UN-UN-Region>
    1928 <https://monarchinitiative.org/_10090chr11-UN-UN-Region>
kshefchek commented 8 years ago

Per @mbrush these should be blank nodes, so it may be best to skolemize these based on @TomConlin's skolemization pattern instead of having them resolve.

cmungall commented 8 years ago

+1 to skolemization (that need not be in opposition to having them resolve. We can provide a page for any node in the graph)

in many cases we may want to simply look at our modeling and decide if we can't do without natural keys. For example, when modeling disease to gene, does our some-variant pattern buy us much a direct link would not?

mbrush commented 8 years ago

fyi a bit of history on the _fbcvtermkey IRIs in the tickets #248 and #284.

mbrush commented 8 years ago

A few thoughts. First, I am not totally clear on the history here, but my understanding is that a decision was made way back when that if a source didn't have a resolvable identifier for some entity we want to represent as a node in our data, we would make it a bnode.

I believe the stated rationale for this bnode approach was partly political (that we don’t want to mint IRIs in our namespace for things that don’t "belong to us"), and partly pragmatic (that we didn’t want to be responsible for managing such IRIs and their resolution on the semantic web). But I think this was a decision that was meant to be revisited, and it sounds like we are now thinking (and I agree) that we should come up with some skolemization approach for minting IRIs for some or all bnodes in our data - so that they can resolve with some degree of information for our users.


Second, with respect to the provenance of all the :_foo IRI in the data- I think these are temporary IRIs that failed to be converted to bnodes in the production data (i.e. I don’t think Nicole intended this to be a permanent skolemization scheme for such 'orphan' nodes)


Third, I agree with the point about "some variant of ____" bnodes in the data. I am not a fan of this pattern and don’t think it buys anything right now. If I recall, one rationale for this pattern was that we hoped to ultimately be able to identity these anon variants from the ClinVar data, and replace the anon variant bnode with a clinvar iri. But this seems like wishful thinking. So I think we can directly link the gene to the disease using RO:0002200 ! has_phenotype for now. As Chris points out in the ticket here, this seems to be what sources like omim are directly asserting.

kshefchek commented 8 years ago

Part of the issue here was the differentiation of a direct vs. inferred association, which has been removed entirely from our UI for the time being.

For human data, I don't mind this pattern for disease/phenotype associations. It allows us to eventually go back and add important information already contained in OMIM, although sometimes in free text, such as zygosity, inheritance, specific mutation. Although at this point we would need to create genotype bnodes instead of variants I believe.

However, there are some cases where this pattern could be reconsidered. For example, wormbase annotates models to genes, for example, http://www.wormbase.org/species/c_elegans/gene/WBGene00001049#07-9gc-3, and we create a blank intermediate variant node to use as the "worm model" for the disease. This is discussed partially here: #330.

TomConlin commented 7 years ago

the Flybase function _makeInternalIdentifier() was largely responsible I have it emitting legit opaque blank node Identifiers now but we should be creating labels and types if possible