monarch-initiative / dipper

Data Ingestion Pipeline for Monarch
https://dipper.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
57 stars 26 forks source link

odd identifier to protein showing up for NCBI gene #541

Open selewis opened 7 years ago

selewis commented 7 years ago

For example: human NCBI:5435. Searching for the encoding gene via the ENCODES relation returns several ENSEMBL proteins, UniProtKB:P61218, itself, and (oddly) ":.well-known/genid/NCBIGene5435product"

Clearly not a legit ID

TomConlin commented 7 years ago

Hi Suzi unfortunately not so clear, but an opportunity anyway. This is the official syntax of a Skolemised blank node

in the abstract: https://en.wikipedia.org/wiki/Skolem_normal_form

concrete implementation:
https://www.w3.org/TR/rdf11-concepts/

I not attempting to address whether you should be coming across it, nor if it is in itself correct but it is the result of our decision to materialize all blank nodes

as of a week or so ago there is a page stub for these

https://monarchinitiative.org/.well-known/genid/NCBIGene5435product

there are plans to flesh it out with label ,type, and links to ancestors & decedents there also needs to be a catchall pag like this one for bnode ids which have sunsetted

The "identifier" portion "NCBIGene5435product" is a pattern that predates me (and came out of your office space) current best practice would have as a opaque digest and the bnode would have include that information as its label

searching GH for skolem* will find more details

selewis commented 7 years ago

I think I can filter out "blank" nodes with one of the parameters, but relieved to know this was intentional not inadvertent.

On Thu, Oct 19, 2017 at 4:47 PM, Tom Conlin notifications@github.com wrote:

Hi Suzi unfortunately not so clear, but an opportunity anyway. This is the official syntax of a Skolemised blank node

in the abstract: https://en.wikipedia.org/wiki/Skolem_normal_form

concrete implementation: https://www.w3.org/wiki/BnodeSkolemization

I not attempting to address whether you should be coming across nor if it is in itself correct but it is the result of our decision to materialize all blank nodes

as of a week or so again there is a page stub for these

https://monarchinitiative.org/.well-known/genid/NCBIGene5435product

there are plans to flesh it out with label ,type, and links to ancestors & decedents there also needs to be a catchall for bnode ids which have sunsetted

The "identifier" portion "NCBIGene5435product" is a pattern that predates me (and came out of your office space) current best practice would have as a opaque digest and the bnode would have include that information as its label

searching GH for skolem* will find more details

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/monarch-initiative/dipper/issues/541#issuecomment-338068232, or mute the thread https://github.com/notifications/unsubscribe-auth/ABcuEEtpuV3leiBPIs242rHGK1Yl9s2_ks5st9-cgaJpZM4P_7iN .

selewis commented 7 years ago

Actually the param for including blanks was/is set to False, so it doesn't seem to be working...

selewis commented 7 years ago

Perhaps the initial ':' character makes a difference?

TomConlin commented 7 years ago

in RDF turtle ":" (the empty curie prefix) is shorthand for the default base IRI
so
:.well-known/genid/NCBIGene5435product
and
https://monarchinitiative.org/.well-known/genid/NCBIGene5435product
are equivalent

I do not know where you have a parameter to filter blank nodes but not all tools/systems are skolemized blank node aware and may only look for the "_:" curie prefix

selewis commented 7 years ago

Okay, let's close this out. The parameter for blankNodes is passed into GOLr service. Would have to dig into that further. In the meantime, just ignoring it.

selewis commented 7 years ago

Chris would like to leave this open so we don't forget about it.