obophenotype / cell-ontology

An ontology of cell types
https://obophenotype.github.io/cell-ontology/
Creative Commons Attribution 4.0 International
135 stars 49 forks source link

Map all PRO terms used in CL to uniprot (where possible). #2293

Open dosumis opened 5 months ago

dosumis commented 5 months ago

We need to be able to map PRO terms used by CL to something the rest of the world can use. I think that means uniprot. Xrefs to uniprot are rare:

https://api.triplydb.com/s/tuAThwx4i

We mostly have xrefs to

Where we can't map based on ID, I think we may need to resort to lexical mapping. One option for this is GILDA.

@addiehl - any other suggestions based on your prior work on these + other linked resources?

dosumis commented 5 months ago

@cmungall - any suggestions for strategy?

addiehl commented 5 months ago

It might be useful to ask Darren @nataled

nataled commented 5 months ago

I'll overlook the "something the rest of the world can use" comment ;)

The results of that SPARQL query fall into two types:

1) The xref points to a protein family. These are cases where the PRO term was created on the basis of the indicated xref at the time the term was created. Prefixes include: PIRSF: https://proteininformationresource.org/cgi-bin/ipcSF?id= PANTHER: http://www.pantherdb.org/panther/family.do?clsAccession= IUPHARfam: http://www.guidetopharmacology.org/GRAC/FamilyDisplayForward?familyId= IUPHARobj: http://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=

2) The xref points to a specific protein or proteoform. For all these, the DTO and Reactome xrefs are superfluous in that they also have a UniProtKB xref. Prefixes include: UniProtKB: http://purl.uniprot.org/uniprot/ DTO: http://www.drugtargetontology.org/dto/DTO_ Reactome: http://www.reactome.org/content/detail/

For the first set, no single UniProtKB mapping is appropriate. Are you trying to obtain all the possible UniProtKB entries pertinent to those xrefs?

dosumis commented 4 months ago

@nataled - many thanks for the details.

Various uses. In general including IDs that bioinformaticians are familiar with opens up more possibilities for them to use markers recorded in CL in their analyses.

More specifically, we're working on a Cell Type knowledge base with a focus on cell markers in human and mouse. We have other sources of known and potential markers - curated and computed. I'd like to find some way to fold in curated cell surface markers from CL.

It looks to me like in most cases 'family' here means a general term for the gene across species.

i pro_label PRO ID xref
1 "CD19 molecule"^^http://www.w3.org/2001/XMLSchema#string obo:PR_000001002 "IUPHARobj:2764"^^http://www.w3.org/2001/XMLSchema#string
2 "CD19 molecule"^^http://www.w3.org/2001/XMLSchema#string obo:PR_000001002 "PIRSF:PIRSF016630"^^http://www.w3.org/2001/XMLSchema#string

It also looks like we could pull the mouse and human uniprot IDs from the PIR pages: https://proteininformationresource.org/cgi-bin/ipcSF?id=PIRSF016630. Is there an API option? If not we will scrape. This will work for our KB plans. I think also useful to include these IDs in CL under some AP.
 

dosumis commented 4 months ago

Seems we can use the structure of PRO to extract many of these, e.g.

https://api.triplydb.com/s/WGSZidIVe

PRO - CL Marker Mouse specific subclass mouse xref  
ADP-ribosyl cyclase/cyclic ADP-ribose hydrolase 1 ADP-ribosyl cyclase/cyclic ADP-ribose hydrolase 1 (mouse) UniProtKB:P56528
B-cell lymphoma 6 protein B-cell lymphoma 6 protein homolog (mouse) UniProtKB:P41183
B-cell receptor CD22 B-cell receptor CD22 (mouse) UniProtKB:P35329
C-C chemokine receptor type 1 C-C chemokine receptor type 1 (mouse) UniProtKB:P51675
C-C chemokine receptor type 2 C-C chemokine receptor type 2 (mouse) UniProtKB:P51683

The subclasses are not (currently ) in the import & even if they were, we should still find some way to better support bioinformatician users. From looking at the numbers, this won't work in every case, but is a good start.

Suggested mechanism to extract:

For all PRO terms used as markers for CL terms:

TBD: Accessible representation in CL.

dosumis commented 4 months ago

CC @AvolaAmg

cmungall commented 4 months ago

Yes, I believe most of the pr terms used in cl are category=gene and follow a stereotypical text definition marking them as the product of the reflexive ontolog of the human gene

Eg https://www.ebi.ac.uk/ols4/ontologies/pr/classes/http%253A%252F%252Fpurl.obolibrary.org%252Fobo%252FPR_000001408?lang=en

Ideally pro would have logical definitions for these, which would make tracing back easier. Should be easy to do this via string matching but ideally this would be done upstream of pro

Another idea would be pro releases sssom with inferred downward mappings for all category=gene

On Sat, Feb 24, 2024 at 8:51 AM David Osumi-Sutherland < @.***> wrote:

Seems we can use the structure of PRO to extract many of these, e.g.

https://api.triplydb.com/s/WGSZidIVe

The subclasses are not (currently ) in the import & even if the were, we should still find some way to better support bioinformatician users.

— Reply to this email directly, view it on GitHub https://github.com/obophenotype/cell-ontology/issues/2293#issuecomment-1962421169, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAMMOOXDNT2JQPQVXSAID3YVIK7HAVCNFSM6AAAAABDXDR2HWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNRSGQZDCMJWHE . You are receiving this because you were mentioned.Message ID: @.***>

nataled commented 4 months ago

The file containing PIRSF membership can be found at https://proteininformationresource.org/projects/pirsf/. Note that the identifiers in this file don't contain 'PIR' (so, 'SF001234' instead of 'PIRSF001234'). This file goes beyond human and mouse, if that's what you need. If you only want human and mouse, then you can use our 'descendants' API for PRO:

https://lod.proconsortium.org/api.html#/DAG/getDescendantByProIDs

which is part of a larger set of APIs given here:

https://lod.proconsortium.org/api.html

You'll want to focus on the terms with local IDs that have UniProtKB accessions without a dash.