monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
56 stars 26 forks source link

IMPC #933

Closed TomConlin closed 4 years ago

TomConlin commented 4 years ago

tldr; evidence/provanance URI for IMPC now track what they say they should be.

Deals with the disconnect between the "stable" id supplied in their download files and the integer "key" they use in the URI for on the page for the concept the stable id is for. that is they have quazi-stable URI identifiers (they don't 404 they redirect).

Since these keys are not included in the downloads they were screen-scraped off IMPC website. (just like back in the '90s) however the process produced duplicates and misses. In the past it there were just a few I had been known to just web search to get the new pages needed (not proud). Noteing you had to use google because their internal search did not know about their stable ids....
With a major release there were enough that I got tired and fixed it.

They do have the keys associated with stable ids in the database dump. Unfortunatley it is a 1.2G download that expands to a 20G instance from which I extract 1/1000th of 1% to use as a mapping which is wasteful. Maybe I can get some Solr query to cut that down a bit.

They also have new URI patterns all around which seem more straight forward to me. (see curie_map)

This means we can get rid of all the 5k or so hard coded URI in the local translation table and never have to add any more.

With three types of stable ids pipeline procedure & parameter (there are others) and the URI taking 2 keys, that is:

there ends up being in the order of n squared possible URI where n is currently over 5k which is far too many to bother with manually and since they do change the keys associated with a stable id (see the unit test) that 25k is a just the start.