openphacts / GLOBAL

Global project issues [private for now. owner lee harland]
3 stars 0 forks source link

Missing data from Target Information #253

Open danidi opened 9 years ago

danidi commented 9 years ago

Migrating it here from an email conversation (drugbank target mappings). Depending on the URI which is used as query, different information from the target information call is missing. Using uniprot, the ConceptWiki information is missing: https://beta.openphacts.org/1.5/target?uri=http%3A%2F%2Fpurl.uniprot.org%2Funiprot%2FP11362&app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1

Using the corresponding ConceptWiki URI, the drugbank information is missing: https://beta.openphacts.org/1.5/target?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconcept%2Findex%2F6b60572a-1ea7-4c31-8408-b59537dd4b84&app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1

Here is another example: https://beta.openphacts.org/1.5/target?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconcept%2Fb79f8003-ce3c-4056-9169-7bc93ff7ed60&app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1

(The corresponding uniprot URI retrieves the Conceptwiki URI here: https://beta.openphacts.org/1.5/target?uri=http%3A%2F%2Fpurl.uniprot.org%2Funiprot%2FQ13233&app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1)

stain commented 9 years ago

Could you clarify what you mean by "different information is missing"? Depending on which URI you start with you are doing a different query of slightly different concept. Therefore the information returned is structured according to the identifier you asked about, e.g. uniprot information like <organism> will be on the top-level element if you ask about the uniprot ID, but inside exactMatch of the mapped uniprot identifier if you ask about the concept wiki ID.

Or are there other differences due to different identity mappings?

You will see that the first link do not include identifiers like mesh:C496348 and ncim:C1527757, both of which are protein justifications. The mapped drugbank identifiers are also different. Starting with uniprot we find a mapping in the new drugbank target v4 mapping:

Drugbank Target v4

BE0002131
    http://bio2rdf.org/drugbank:BE0002131

And thus the uniprot lookup contains:

<item href="http://bio2rdf.org/drugbank:BE0002131">
  <targetForDrug>
    <item href="http://bio2rdf.org/drugbank:DB02058">
       <inDataset href="http://www.openphacts.org/bio2rdf/drugbank"/>
       <genericName xml:lang="en">SU4984</genericName>
       <drug_type xml:lang="en">experimental [drugbank_resource:Experimental]</drug_type>
    </item>
  <!-- .. -->
</item>

But in the mapping from conceptwiki we only find the v3 drugbank target:

http://identifiers.org/drugbank.target/3854

Other transitives via uniprot:P11362 are followed in the identity mappings of conceptwiki:6b60572a-1ea7-4c31-8408-b59537dd4b84 - so it seems that the new uniprot/drugbank linkset is not considered for transitivies, even with the All lens.

danidi commented 9 years ago

Yes, I'm aware of the different structuring with the exactMatch block. But in some cases, a whole block is missing. My first example seems to work properly now. Maybe I overlooked the Conceptwiki block each time I had a look at it before. So the only issue here is now the missing Drugbank block when you start with another URI than uniprot. Can the one to many mappings in db-uniprot-ls.ttl cause problems with the transitives? Or is it possible to add this linkset to the transitives as well?

stain commented 9 years ago

I think this could be related to #251 which shows that the http://bio2rdf.org/drugbank:DB* pattern is missing in 1.5 IMS.

stain commented 9 years ago

Fixes in #251 now live on ops2, but this problem remains, so it seems to be unrelated and probably got something to do with transitives, so I'll leave this open and investigate further tomorrow.

stain commented 9 years ago

I believe this is because http://ops2.few.vu.nl:8081/QueryExpander/dataSource/drugbankTarget and http://ops2.few.vu.nl:8081/QueryExpander/dataSource/drugbankv4.target are not listed as Allowed Middle Sources in the Default lens.

Is your suggestion to add both to the allowed middle sources? I think that might not be what you want..

danidi commented 9 years ago

I don't know (I haven't heard of Allowed Middle Sources so far...). What would be the consequences? Which datasources are currently allowed middle sources? http://ops2.few.vu.nl:8081/QueryExpander/dataSource/drugbankTarget looks strange as it has both molecule and target URIs included. Also, they have the old drugbank version, not sure if they are still valid.

stain commented 9 years ago

Allowed Middle Sources are linksets which can be used as transients.. so for instance following the equality links (made up example):

 conceptwiki --> drugbank --> uniprot -->  --> ensembl

would require both uniprot and drugbank as Middle Sources.

The far right column of Default on http://ops2.few.vu.nl:8081/QueryExpander/Lens shows the sources that are currently allowed.

Not sure about why drugbankTarget includes both URI patterns for molecules and targets - I'll raise that as a new bug - perhaps not include that as a middle source to be safe. The only linkset included here includes only links to targets:

http://openphacts.cs.man.ac.uk/ims/dev/version1.5.0-SNAPSHOT/ConceptWiki/www4_wiwiss_fu_berlin_de_drugbank_resource_targets-protein.ttl

This molecule pattern is not included for the v4 targets:

http://ops2.few.vu.nl:8081/QueryExpander/dataSource/drugbankv4.target

As if we need both v3 and v4 drugbank targets I don't know. That needs to be checked against the cache and queries.

Checking further for this I can't see any outgoing links from Drugbankv4.target except back again to Uniprot (which would not be followed), so presumably adding it as a middle source would not make any changes to the output as well - so something else is wrong. I will try around with some alternative middle sources in my local install to check.

stain commented 9 years ago

To summarize:

The link chain we want to be followed are:

so for some reason the transitive link from uniprot to drugbankv4.target is not followed, but it IS shown if looking up the uniprot directly. The lenses and justifications should permit this.

@danidi pointed out that there could be an issue within the Transitive-on-the-Fly that needs to be told separately about the new drugbankv4.target linkset, so I'll investigate this using a debugger.

Related mapping sets:

through

Source
    ConceptWiki
Target
    Uniprot-TrEMBL
Predicate
    exactMatch
Justification
    ConceptWikiProtein
Mapping Source
    www_uniprot_org_uniprot-protein.ttl

and

Source
    Uniprot-TrEMBL
Target
    Drugbank Target v4
Predicate
    exactMatch
Justification
    SIO_001171
Mapping Source
    db-uniprot-ls.ttl
stain commented 9 years ago

It is caused by incompatible justifications.

https://github.com/bridgedb/BridgeDb/blob/OpenPHACTS/develop/org.bridgedb.uri.sql/src/org/bridgedb/sql/justification/OpsJustificationMaker.java#L192

shows the justifications that are currently allowed from ConceptWikiProtein, which includes only SIO_010043 (protein) and SIO_000985 (protein coding gene), but not SIO_001171 (database cross-reference) which is what is used in the linkset from Uniprot to Drugbank.

Hence Uniprot to Drugbank is not currently combinable with ConceptWiki to Uniprot. Do you think it should be? If I add SIO_001171, it would also enable lots of other transitively linksets to be allowed through ConceptWiki, e.g. Ensembl (as is commented in the code to explicitly not allow..)

One workaround - hand-edited VoId file with a different justification than SIO_001171 (which we could then add to the OpsJustificationMaker if needed). What is truly the link from uniprot to drugbank? Can it be something more specific than "cross reference"?

danidi commented 9 years ago

Wow, congratulations on figuring that one out! Could SIO_010043 (protein) be added to the Uniprot/Drugbank linkset, to add only this one for now? I think the protein justification is used for mappings between different protein identifiers (although the definition of a protein is basically something else). I'm a bit hesitant to include all database cross-reference datasets, if we don't know which other datasets this would include. Maybe something to keep in mind for the reload of the IMS in the future?

stain commented 9 years ago

I think that is the easiest workaround to just use SIO_010043 here, which is a simple change to the data loading and requires no code changes. I shall have a go.

stain commented 9 years ago

Workaround loading with SIO_010043 works good.

See http://ops2.few.vu.nl:8081/QueryExpander/mapUri?Uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconcept%2Findex%2F6b60572a-1ea7-4c31-8408-b59537dd4b84&lensUri=http%3A%2F%2Fopenphacts.org%2Fspecs%2F%2FLens%2FDefault&Pattern+Filter=&overridePredicateURI=&format=text%2Fhtml

which now includes http://bio2rdf.org/drugbank:BE0002131

and thus drugbank info is included in ops2:

http://ops2.few.vu.nl/1.5/target?uri=http%3A%2F%2Fwww.conceptwiki.org%2Fconcept%2Fb79f8003-ce3c-4056-9169-7bc93ff7ed60&app_id=f91c5b2b&app_key=18a5d823d0e4933ac5fe22a3d52974c1

So I hand it over to Yrjänä @ghard to deploy at OpenLink:

stain commented 9 years ago

Sysadmin info for @antonisloizou - updated Docker containers on ops2.few.vu.nl are ims-20150513 linked to mysql-for-ims-20150513, ports remain the same (AJP 8009, HTTP 8081) e.g. http://ops2.few.vu.nl:8081/QueryExpander/

The "semi-empty" chembl20 instance at http://ops2.few.vu.nl:8082/QueryExpander has not been updated as it has not got the drugbank linkset yet.

Documentation on how to reproduce: https://github.com/openphacts/queryExpander/tree/master/docker#custom-data-loading

nicklynch commented 9 years ago

@stain is this still relevant to IMS 2.0 or should we close?