openphacts / GLOBAL

Global project issues [private for now. owner lee harland]
3 stars 0 forks source link

Remove protein groups in Drugbank-Uniprot linkset #284

Open stain opened 8 years ago

stain commented 8 years ago

@ianwdunlop and @danidi raises:

it seems the one to many mappings in the drugbank target linkset do actually have an impact: http://support.openphacts.org/discussions/topics/4000322556/page/last#post_4000454933 Maybe we should look at a possibility to exclude those from the default lens in the reload?

API call: https://beta.openphacts.org/1.5/target?_format=json&app_key=ad4fca9111f258325e3ca50e7217dcbc&app_id=a409dcc9&uri=http%3A%2F%2Fpurl.uniprot.org%2Funiprot%2FQ9Y5Y9

IMS call:

http://openphacts.cs.man.ac.uk:9095/QueryExpander/mapUri?Uri=http%3A%2F%2Fpurl.uniprot.org%2Funiprot%2FQ9Y5Y9&lensUri=http%3A%2F%2Fopenphacts.org%2Fspecs%2F%2FLens%2FDefault&Pattern+Filter=&overridePredicateURI=&format=text%2Fhtml

says that http://purl.uniprot.org/uniprot/Q9Y5Y9 (Sodium channel protein type 10 subunit alpha) matches both:

RDF data for BE0004901 contains multiple values for theoretical-pi, molecular weight, locus, etc - see this example sparql query

In the db-uniprot-ls RDF linkset we find this as a many-to-one mapping

ns1:BE0004901 skos:exactMatch ns2:Q99250 ,
ns2:Q8IWT1 ,
ns2:Q07699 ,
ns2:Q01118 ,
ns2:Q9UQD0 ,
ns2:Q9UI33 ,
ns2:Q9NY46 ,
ns2:Q9NY72 ,
ns2:O60939 ,
ns2:P35499 ,
ns2:Q9Y5Y9 ,
ns2:P35498 ,
ns2:Q15858 ,
ns2:Q14524 .

# ..

ns1:BE0000177 skos:exactMatch ns2:Q9Y5Y9 .

@antonisloizou made the linkset - while @stain modified the void in #253 to use justification protein rather than cross reference.

The linkset was made by SPARQL query over our Uniprot and Bio2RDF Drugbank data - see the void.

I can't find an easy way to modify the query to filter out the protein groups as there is no such typing from bio2rdf (which should probably be raised with bio2rdf.org) - except perhaps by skipping those that have say more than 1 molecular weight?

Chris-Evelo commented 8 years ago

I think that db-uniprot-ls is doing this that a linkset should not do. It resolves a descriptor for a group of proteins to the actual proteins belonging to that group (which is should not do) and resolves a specific drugbank target to the corresponding UniProt ID (which is should do). I would be nice if we could split that linkset in two where the group descriptions would not become available in the IMS but in the cache.

danidi commented 8 years ago

One problem here (as far as I see) is that we could lose the connection between a uniprot target and the drug via drugbank, if it is there only connected to the protein group e.g. a channel with many subtypes. But as we are splitting the Chembl target linkset up, to include only single target interactions to the default lens, I think it would be consistent to remove the one to many mappings from drugbank as well (at least for default). I don't know how responsive the bio2rdf team is, but I guess the most correct solution would be to add the "kind" attribute from drugbank to the rdf, and then retrieve mappings for "protein" only. Until then, skipping the ones with more than one molecular weight could work, or a manual edit of the linkset.

AlasdairGray commented 8 years ago

We could include both drug bank link sets but with different justifications; they are doing different jobs.

Sent by Outlookhttp://taps.io/outlookmobile for Android

On Mon, Aug 17, 2015 at 7:49 AM -0700, "danidi" notifications@github.com<mailto:notifications@github.com> wrote:

One problem here (as far as I see) is that we could lose the connection between a uniprot target and the drug via drugbank, if it is there only connected to the protein group e.g. a channel with many subtypes. But as we are splitting the Chembl target linkset up, to include only single target interactions to the default lens, I think it would be consistent to remove the one to many mappings from drugbank as well (at least for default). I don't know how responsive the bio2rdf team is, but I guess the most correct solution would be to add the "kind" attribute from drugbank to the rdf, and then retrieve mappings for "protein" only. Until then, skipping the ones with more than one molecular weight could work, or a manual edit of the linkset.

— Reply to this email directly or view it on GitHubhttps://github.com/openphacts/GLOBAL/issues/284#issuecomment-131849058.


We invite research leaders and ambitious early career researchers to join us in leading and driving research in key inter-disciplinary themes. Please see www.hw.ac.uk/researchleaders for further information and how to apply.

Heriot-Watt University is a Scottish charity registered under charity number SC000278.

stain commented 8 years ago

Using http://heater.cs.man.ac.uk:3003/sparql (1.5 data) - this query works:

PREFIX db: <http://bio2rdf.org/drugbank_vocabulary:>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX uniprot: <http://purl.uniprot.org/core/>
PREFIX drugbank: <http://bio2rdf.org/drugbank_vocabulary:>

CONSTRUCT {
  ?target skos:exactMatch ?uniprot .
}
WHERE {

      GRAPH <http://www.openphacts.org/bio2rdf/drugbank> {
      [] db:target ?target .
      ?target db:x-uniprot ?uniprot .

      {  SELECT ?target (COUNT(?molweight) AS ?molweightCount) 
         WHERE {    
           ?target drugbank:molecular-weight ?molweight .
         } 
         GROUP BY ?target
      }
  }
  GRAPH <http://purl.uniprot.org> {
    ?uniprot a uniprot:Protein .
  }
  FILTER ( ?molweightCount = 1 )
}

Gives 3572 matches, all 1-to-1.

The excluded groups, change to:

  FILTER ( ?molweightCount > 1 )

which are all 201 results 1-to-many.

That still excludes the 3 links that don't have any molecular weight:

Should they be added in the first linkset?

stain commented 8 years ago

Yes - I think those molweight-ones should be included - so that you still get drugbank info for those.

So I'll refine the query.

stain commented 8 years ago

Refined to include 0-molweight:

PREFIX db: <http://bio2rdf.org/drugbank_vocabulary:>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX uniprot: <http://purl.uniprot.org/core/>
PREFIX drugbank: <http://bio2rdf.org/drugbank_vocabulary:>

CONSTRUCT {
  ?target skos:exactMatch ?uniprot .
}
WHERE {

      GRAPH <http://www.openphacts.org/bio2rdf/drugbank> {
      [] db:target ?target .
      ?target db:x-uniprot ?uniprot .

      OPTIONAL {
      {  SELECT ?target (COUNT(?molweight) AS ?molweightCount) 
         WHERE {    
           ?target drugbank:molecular-weight ?molweight .
         } 
         GROUP BY ?target
      }
      }
  }
  GRAPH <http://purl.uniprot.org> {
    ?uniprot a uniprot:Protein .
  }
  FILTER ( !( bound(?molweightCount)) || ?molweightCount = 1  )
}

I'll update the void

stain commented 8 years ago

Linksets now at http://data.openphacts.org/dev/ims/linksets/drugbank/ -- partially updated void (need to add the second linkset) -- which justification should I use for the second linkset, @AlasdairGray ? Something to do with protein groups? Is exactMatch still valid there? (I think they are protein groups in Uniprot as well, but worth checking)

danidi commented 8 years ago

http://data.openphacts.org/dev/ims/linksets/chembl/chembl_20.0_grouptarget_targetcmpt_ls.ttl uses related match, maybe that would be applicable here as well? The justification there is 'has member' http://semanticscience.org/resource/SIO_000059.rdf (if I read the void correctly). Maybe using the same justification would make it easier to combine them in a lens?

AlasdairGray commented 8 years ago

Is a protein group the same as a protein complex or is it a family of proteins?

danidi commented 8 years ago

I think drugbank includes both complexes and families as protein group. E.g. http://www.drugbank.ca/biodb/bio_entities/BE0004863 and http://www.drugbank.ca/biodb/bio_entities/BE0004888 look like families to me, others like http://www.drugbank.ca/biodb/bio_entities/BE0004924 contain proteins that seem to be part of a complex.

AlasdairGray commented 8 years ago

I assume there is no way of distinguishing this without manual inspection of the data. They all seem to claim to be protein groups.

Is anyone aware of a ontological term to capture this? I've had a quick look around and can't see anything that would fit.

stain commented 8 years ago

We can use skos:narrowMatch as the relation, which I think is better than skos:relatedMatch in this case - there is a simple hierarchical relation. Would then the justification then be OK to stay at protein?

stain commented 8 years ago

Or should we put the justification back to cross reference (for the protein group linkset).. as that's what it initially was in drugbank.

AlasdairGray commented 8 years ago

Going back to the original example here because I am losing the thread slightly.

http://purl.uniprot.org/uniprot/Q9Y5Y9 (Sodium channel protein type 10 subunit alpha) should be a skos:exactMatch http://www.drugbank.ca/biodb/bio_entities/BE0000177 (Protein: Sodium channel protein type 10 subunit alpha) with the justification that they are the same protein.

http://purl.uniprot.org/uniprot/Q9Y5Y9 (Sodium channel protein type 10 subunit alpha) should be the inverse of obo:has_part to http://www.drugbank.ca/biodb/bio_entities/BE0004901 (Protein Group: Sodium channel protein). I guess we could keep the justification as protein.

stain commented 8 years ago

What URL do you mean exactly when you say obo:has_part? Do you mean http://purl.obolibrary.org/obo/BFO_0000051 ?

Is this appropriate for protein group/family vs proteins? What common compatible class do they have?

editor note: Parthood requires the part and the whole to have compatible classes: only an occurrent have an occurrent as part; only a process can have a process as part; only a continuant can have a continuant as part; only an independent continuant can have an independent continuant as part; only a specifically dependent continuant can have a specifically dependent continuant as part; only a generically dependent continuant can have a generically dependent continuant as part. (This list is not exhaustive.)

A continuant cannot have an occurrent as part: use 'participates in'. An occurrent cannot have a continuant as part: use 'has participant'. An immaterial entity cannot have a material entity as part: use 'location of'. An independent continuant cannot have a specifically dependent continuant as part: use 'bearer of'. A specifically dependent continuant cannot have an independent continuant as part: use 'inheres in'.; Everything has itself as a part. Any part of any part of a thing is itself part of that thing. Two distinct things cannot have each other as a part.; Occurrents are not subject to change and so parthood between occurrents holds for all the times that the part exists. Many continuants are subject to change, so parthood between continuants will only hold at certain times, but this is difficult to specify in OWL. See https://code.google.com/p/obo-relations/wiki/ROAndTimeexample of usage: my body has part my brain (continuant parthood, two material entities); this year has part this day (occurrent parthood); my stomach has part my stomach cavity (continuant parthood, material entity has part immaterial entity)

In the old RSC linksets we had dul:expresses obo2:has_part - http://data.openphacts.org/1.5/ims/linksets/RSC/void_2013-11-12.ttl -- expands to http://purl.obolibrary.org/obo#has_part

which to me seems odd, as the URLs don't match up. I never understand these OBO ontologies that have so many multiple identifiers. There's also in http://www.obofoundry.org/ro/pre/ro.owl the property http://purl.org/obo/owl/OBO_REL#has_part which if you click the URL actually don't define that URL, but http://www.obofoundry.org/ro/ro.owl#has_part instead. HEEEELP!

Besides this - in the RSC example we used it with dul:expresses and used skos:relatedMatch in the actual linkset.

I am not sure if we would need to modify the lenses.. or if the All lense would be happy with this anyway we go.

danidi commented 8 years ago

For the lenses, it would be good if we could in the end distinguish between complexes and families (especially for the ChEMBL, where we have different linksets for these), as users might want to retrieve data for complexes, but not for families. Where would we retrieve the DrugBank data here? Is that easy to distinguish if the justification is protein in all cases? I also saw http://purl.obolibrary.org/obo/RO_0002351 has member, which might be useful for families (has member is a mereological relation between a collection and an item.)

batchelorc commented 8 years ago

Hello all,

For being a member of a family you just want the subclass relation, I’m pretty sure.

Best wishes, Colin.

From: danidi [mailto:notifications@github.com] Sent: 20 August 2015 16:20 To: openphacts/GLOBAL Subject: Re: [GLOBAL] Remove protein groups in Drugbank-Uniprot linkset (#284)

For the lenses, it would be good if we could in the end distinguish between complexes and families (especially for the ChEMBL, where we have different linksets for these), as users might want to retrieve data for complexes, but not for families. Where would we retrieve the DrugBank data here? Is that easy to distinguish if the justification is protein in all cases? I also saw http://purl.obolibrary.org/obo/RO_0002351 has member, which might be useful for families (has member is a mereological relation between a collection and an item.)

— Reply to this email directly or view it on GitHubhttps://github.com/openphacts/GLOBAL/issues/284#issuecomment-133047120.

DISCLAIMER:

This communication (including any attachments) is intended for the use of the addressee only and may contain confidential, privileged or copyright material. It may not be relied upon or disclosed to any other person without the consent of the Royal Society of Chemistry. If you have received it in error, please contact us immediately. Any advice given by the Royal Society of Chemistry has been carefully formulated but is necessarily based on the information available, and the Royal Society of Chemistry cannot be held responsible for accuracy or completeness. In this respect, the Royal Society of Chemistry owes no duty of care and shall not be liable for any resulting damage or loss. The Royal Society of Chemistry acknowledges that a disclaimer cannot restrict liability at law for personal injury or death arising through a finding of negligence. The Royal Society of Chemistry does not warrant that its emails or attachments are Virus-free: Please rely on your own screening. The Royal Society of Chemistry is a charity, registered in England and Wales, number 207890 - Registered office: Thomas Graham House, Science Park, Milton Road, Cambridge CB4 0WF

Chris-Evelo commented 8 years ago

Herman Hermjakob told me they are working on a new IntAct resource for complexes: https://www.ebi.ac.uk/intact/complex/ That should become the central place for complexes used in EBI resources. We might be able to map complexes not (groups of related proteins but real interacting proteins in a complex) from various resources to this as a central ID provider.

AlasdairGray commented 8 years ago

@stain I would model them with the same predicate and justifications that we have used for the new ChEMBL linksets