monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

Align GO JSON-LD context with dipper curie-map #582

Open cmungall opened 6 years ago

cmungall commented 6 years ago

TODO: report on clashes

cmungall commented 6 years ago
TomConlin commented 6 years ago

1) I applaud giving the primary database URLs their due. 2) dipper warns on non 1:1 maps https://github.com/monarch-initiative/dipper/blob/master/dipper/utils/CurieUtil.py#L21 GO might want to as well 3) mind httpS where possible

cmungall commented 6 years ago

I applaud giving the primary database URLs their due The problem here is that these are often ad-hoc and subset to change. In GO we are going for identifiers.org, more stable, predictable

dipper warns on non 1:1 maps https://github.com/monarch-initiative/dipper/blob/master/dipper/utils/CurieUtil.py#L21 GO might want to as well

We should look at merging dipper CurieUtil and https://github.com/prefixcommons/prefixcommons-py

mind httpS where possible

I have gotten assurances from identifiers.org that they will support http in perpetuity. Same of course true for OBO. Stability is key here

cmungall commented 6 years ago

Note the list above only includes cases where the prefix matches or the URL matches.

It isn't reporting the fact that GO has

- database: Reactome
  name: Reactome - a curated knowledgebase of biological pathways
  synonyms:
    - REACTOME
    - REAC
  rdf_uri_prefix: http://identifiers.org/reactome/
  generic_urls:
    - http://www.reactome.org/
  entity_types:
    - type_name: entity
      type_id: BET:0000000
      id_syntax: R-[A-Z]{3}-[0-9]+(-[0-9]+){0,1}(\.[0-9]+){0,1}
      url_syntax: http://www.reactome.org/content/detail/[example_id]
      example_id: Reactome:R-HSA-109582
      example_url: http://www.reactome.org/content/detail/R-HSA-109582

whereas dipper has

'REACT': 'http://www.reactome.org/PathwayBrowser/#/'

It looks like we have just recommended REACT to translator folks ah well. I'm not sure where this abbreviation came from.

But the URL is a good example of a bad semantic web PURL http://www.reactome.org/PathwayBrowser/#/

jmcmurry commented 6 years ago

Please note that the shortform curie resolution is now supported in identifiers.org. For example http://identifiers.org/MGI:3764834, my preference would be to use these simple URIs throughout our stack, except for OBO purls and other sources that have additional semantic sugar. I've made specific recommendations here.

nathandunn commented 6 years ago

@jmcmurry (sorry to interject) I was talking with @TomConlin about this. I think that its going to be problematic even if it goes to the canonical source. I think you're going to run into problem if you squat on the base-level CURIE. I would propose something like (such that its always scoped):

http://identifiers.org/monarch/MGI:3764834

This way, if the AGR, MONARCH, MGI, etc. can choose where their external links resolve and it reduces any possibility of data collision along the way. Doing it this way, you don't really have to consult anyone outside Monarch, whereas doing it at the root level will require a higher level of coordination for establishing and changing them.

nathandunn commented 6 years ago

But I really do like the identiferis.org approach overall. Its a nice approach to the ever moving / dying web. I'm not sure if there is a better solution, I would just scope any curie in a way that you can own it long-term.

nathandunn commented 6 years ago

Just to clarify my point. It might be fine to use the short-form if, for example, MGI is committed to supporting it internally, as they do the rest of their IDs, but even then, I think you are better off coming up with a scoping model. The reasons are:

1 - prevent potential collisions (can you register an entire CURIE?)

2 - allow an organization that doesn't own the IDs to quickly update changed external IDs (for example, if a downstream organization is using your IDs in a load, so they won't pickup your changed links)

3 - allows for individual organizations to change where a pointed ID goes to, as there are several entities that house the same IDs. e.g., external links on http://identifiers.org/monarch/MGI:107476 points to http://www.informatics.jax.org/marker/MGI:107476 , but http://identifiers.org/myorg/MGI:107476 points to https://www.alliancegenome.org/gene/MGI:107476

4 - at a minimum I don't think we'll be able to grab CURIE's for organizations we don't actively own (I imagine orgs would furious if an organization other than their own controlled their CURIE). It wouldn't be a bad thing to encourage the MODs (for example) to register these with identifiers.org as @jmcmurry suggested, though.

This sort of resolves to a poor man's DNS in some ways, but I think it simplifies things quite a bit.

@cmungall / @jmcmurry / @TomConlin I would be happy to chat about this. A lot of orgs are going to face this. I think that identifiers.org is the right way to do this for many reasons, but I think there needs to be a bit of nuance on the implementation.

cmungall commented 6 years ago

@nathandunn I think you're starting from some different assumptions. Primary use case here is joining triples, not resolution, hence URIs must be identical, organism-specific URIs contrary to this.

choice is between standard id.org URLs or the newer ones that embed CURIEs in URL directly, latter is preferable for many reasons but concerns over effect of colons in various semweb specs

nathandunn commented 6 years ago

@cmungall Thanks for the clarification, and sorry for any confusion. Yes, the CURIE is a no-brainer.

jmcmurry commented 6 years ago

No prob Nathan, agreed we would never ever squat on a curie for our own 3rd party purposes. It would break trust of both users and providers. The new identifiers.org syntax is such that a provider can be specified OR omitted as the user desires; however, where the user omits provider they're redirected to whichever of the trusted authoritative original sources and their close collaborators have the best "up time" record that month. There are some issues related to that, but it is what it is.