monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

HGNC update base URL in curie map #845

Closed TomConlin closed 4 years ago

TomConlin commented 4 years ago

per request

Date: Tue, Oct 22, 2019 at 12:17 PM
Subject: links from HGNC and issues with links to HGNC
To: <info@monarchinitiative.org>
Cc: hgnc@genenames.org <hgnc@genenames.org>

also dropping unused test file that was hanging around. also making several tests less wrong without straying into right.

kshefchek commented 4 years ago

we'll also need dipper and monarch ontology synchrony (via sed), see https://github.com/monarch-initiative/dipper/blob/master/Jenkinsfile#L187

TomConlin commented 4 years ago

The hack could be dropped and reimplimented via a legitimate internal_curie_map.yaml there is also a question of numbers. Seems questionable to take the time to update millions of strings to match what, dozens? hunderds? in the ontologies.

kshefchek commented 4 years ago

The hack could be dropped and reimplemented via a legitimate internal_curie_map.yaml

sounds interesting, can you elaborate?

Seems questionable to take the time to update millions of strings

It's an unfortunate hack but only takes a minute or so, the whole ontology file is ~700mb.

TomConlin commented 4 years ago

if the url are not going to be exposed (by us) they can be anything we want. there are very good reasons to want domesticated identifiers this pr is a perfect example HGNC wants a particular new URL exposed (b/c they have switched to Drupal but I am not judging) this new URL had an exclamation point embedded it it so when we mention it in a shell command line (such as say this Jenkins sed call) it interprets it initiating a shell event (which it is not) Scigraph may misinterpreted '!' as a recursive query indicator etc so it needs escaping. and if one needs to be checked they all do. in a data processing environment not needing to allways check everything is faster , so having a identifier.org or other purl is a definite win.

Having two curie maps, our existing dipper curie_map with the wild native URL the public needs to see to have confidence in our data and an internal _curie_map with tame unsurprising URI with no (well fewer) gotcha chars which are already spoken for in downstream processing tools.

switching between the two is still work
but in an ideal world swapping out the \@prefix section at the top of a turtle file is at least less hacky.

TomConlin commented 4 years ago

Also refactord the hack to only take ~ 20 seconds