saezlab / OmnipathR

R client for the OmniPath web service
https://r.omnipathdb.org/
Other
112 stars 20 forks source link

How to get data on the interactions of all receptor ligands? #75

Closed qijt123 closed 1 year ago

qijt123 commented 1 year ago

Hi,

Thank you very much for your Omnipath database.I want to see the information of intercellular communication, so I want to get the data of the interaction of all receptors and ligands in the Omnipath database. How should I do that?

In addition, I also want to know the pathway information of receptor ligand pairs, can this be obtained from the Omnipath?

Thanks.

deeenes commented 1 year ago

Hi @qijt123,

The intercell annotations in OmniPath intend to cover the broadest available information, that contains a number of false positives. You can indeed filter the intercell network by localizations (e.g. ligands must be secreted, receptors must be plasma membrane), and also by consensus across resources, as shown here. This function provides the greatest flexibility, though some arguments of this function also provide basic filtering.

Alternatively, you can use interactions that have been curated in a cell-cell communication context.

About the pathways see the answer here. Pathways are available in the OmniPath Annotations database. Please note that there are great differences in the concept of pathways between resources: a pathway in SignaLink has completely different meaning than a pathway in KEGG or SIGNOR. Pathways are ultimately functional annotations, i.e. they only tell that some genes or proteins have something to do with a common biological function. It means you can consider other functional annotations too, e.g. MSigDB, HGNC, UniProt. You can explore these and more resources in the OmniPath Annotations database:

library(OmnipathR)
get_annotation_resources()
 [1] "Adhesome"             "Almen2009"            "Baccin2019"           "CancerDrugsDB"        "CancerGeneCensus"     "CancerSEA"            "CellCall"             "CellCellInteractions" "CellChatDB"           "CellChatDB_complex"   "Cellinker"            "Cellinker_complex"   
[13] "CellPhoneDB"          "CellPhoneDB_complex"  "CellTalkDB"           "CellTypist"           "ComPPI"               "connectomeDB2020"     "CORUM_Funcat"         "CORUM_GO"             "CSPA"                 "CSPA_celltype"        "CytoSig"              "DGIdb"               
[25] "DisGeNet"             "EMBRACE"              "Exocarta"             "GO_Intercell"         "GPCRdb"               "Guide2Pharma"         "HGNC"                 "HPA_secretome"        "HPA_subcellular"      "HPA_tissue"           "HPMR"                 "HumanCellMap"        
[37] "ICELLNET"             "ICELLNET_complex"     "Integrins"            "InterPro"             "IntOGen"              "iTALK"                "KEGG-PC"              "kinase.com"           "Kirouac2010"          "Lambert2018"          "LOCATE"               "LRdb"                
[49] "Matrisome"            "MatrixDB"             "MCAM"                 "Membranome"           "MSigDB"               "NetPath"              "OPM"                  "PanglaoDB"            "Phobius"              "Phosphatome"          "PROGENy"              "Ramilowski_location" 
[61] "Ramilowski2015"       "scConnect"            "scConnect_complex"    "SignaLink_function"   "SignaLink_pathway"    "SIGNOR"               "Surfaceome"           "talklr"               "TCDB"                 "TFcensus"             "TopDB"                "UniProt_family"      
[73] "UniProt_keyword"      "UniProt_location"     "UniProt_tissue"       "UniProt_topology"     "Vesiclepedia"         "Wang"                 "Zhong2015"

Then you can access the resources interesting for you, using wide = TRUE results a better format:

library(OmnipathR)
kpc <- import_omnipath_annotations(resources = 'KEGG-PC', wide = TRUE)
# A tibble: 2,904 × 4
   uniprot genesymbol entity_type pathway                                    
   <chr>   <chr>      <chr>       <chr>                                      
 1 A8K7J7  A8K7J7     protein     Galactose metabolism                       
 2 A8K7J7  A8K7J7     protein     Fructose and mannose metabolism            
 3 A8K7J7  A8K7J7     protein     Starch and sucrose metabolism              
 4 A8K7J7  A8K7J7     protein     Amino sugar and nucleotide sugar metabolism
 5 A8K7J7  A8K7J7     protein     Metabolic pathways                         
 6 A8K7J7  A8K7J7     protein     Butirosin and neomycin biosynthesis        
 7 A8K7J7  A8K7J7     protein     Glycolysis / Gluconeogenesis               
 8 B4DDQ8  B4DDQ8     protein     Glycolysis / Gluconeogenesis               
 9 B4DDQ8  B4DDQ8     protein     Pentose phosphate pathway                  
10 B4DDQ8  B4DDQ8     protein     Starch and sucrose metabolism              
# ℹ 2,894 more rows
# ℹ Use `print(n = ...)` to see more rows

The pathway annotations can be added to the network data frame using this function. It is enough to provide the name of the annotation resource, or the annotation data frame. Maybe some interaction annotations can be useful too, you can check these out following the vignette. Another question which network datasets to use: see here the description of the datasets. I would recommend to use omnipath, ligrecextra, and if you need even more interactions, maybe also pathwayextra. The optimal size of the network depends on your downstream methods.

The ligand/receptor annotations are also available at a finer granularity, specific subclasses from specific resources:

library(OmnipathR)
library(dplyr)

ic_spec <- import_omnipath_intercell(
    aspect = 'functional',
    scope = 'specific',
    source = 'resource_specific'
)
ic_spec %>% filter(database == 'HGNC')
# A tibble: 3,609 × 15
   category     parent database scope    aspect     source            uniprot        genesymbol      entity_type consensus_score transmitter receiver secreted plasma_membrane_transmembrane plasma_membrane_peripheral
   <chr>        <chr>  <chr>    <chr>    <chr>      <chr>             <chr>          <chr>           <chr>                 <dbl> <lgl>       <lgl>    <lgl>    <lgl>                         <lgl>                     
 1 angiopoietin ligand HGNC     specific functional resource_specific Q9UKU9         ANGPTL2         protein                   0 TRUE        FALSE    TRUE     FALSE                         FALSE                     
 2 angiopoietin ligand HGNC     specific functional resource_specific COMPLEX:Q9Y5C1 COMPLEX:ANGPTL3 complex                   0 TRUE        FALSE    TRUE     FALSE                         FALSE                     
 3 angiopoietin ligand HGNC     specific functional resource_specific Q86XS5         ANGPTL5         protein                   0 TRUE        FALSE    TRUE     FALSE                         FALSE                     
 4 angiopoietin ligand HGNC     specific functional resource_specific Q6UXH0         ANGPTL8         protein                   0 TRUE        FALSE    TRUE     FALSE                         FALSE                     
 5 angiopoietin ligand HGNC     specific functional resource_specific COMPLEX:Q9UKU9 COMPLEX:ANGPTL2 complex                   0 TRUE        FALSE    TRUE     FALSE                         FALSE                     
 6 angiopoietin ligand HGNC     specific functional resource_specific Q8NI99         ANGPTL6         protein                   0 TRUE        FALSE    TRUE     FALSE                         FALSE                     
 7 angiopoietin ligand HGNC     specific functional resource_specific Q9BY76         ANGPTL4         protein                   0 TRUE        FALSE    TRUE     FALSE                         FALSE                     
 8 angiopoietin ligand HGNC     specific functional resource_specific Q9Y5C1         ANGPTL3         protein                   0 TRUE        FALSE    TRUE     FALSE                         FALSE                     
 9 angiopoietin ligand HGNC     specific functional resource_specific O43827         ANGPTL7         protein                   0 TRUE        FALSE    TRUE     FALSE                         FALSE                     
10 angiopoietin ligand HGNC     specific functional resource_specific O95841         ANGPTL1         protein                   0 TRUE        FALSE    TRUE     FALSE                         FALSE                     
# ℹ 3,599 more rows
# ℹ Use `print(n = ...)` to see more rows

For example, HGNC contains many specific subclasses, e.g. above "angiopoietin" is a class of ligands. These are not exactly pathways, but families sharing a common structure, origin and function.

I hope this help, please let me know if you have further questions. I see you opened the same issue at the Python client, I'm answering it here and closing the other one.

Best,

Denes

qijt123 commented 1 year ago

Hi, @deeenes

Thank you very much for your timely reply. Your answer has solved most of my questions, but I still have some basic questions to know.

I wonder what is the meaning of 'category' and 'parent'? What's the difference between these two. I also want to know what is the difference between 'n_references' and 'n_resources'?

Best

qijt

deeenes commented 1 year ago

Specific categories have generic categories as parents, while each generic category is the parent of itself. All these categories are defined here. The definitions of the terminology are in the EV10 table of our latest paper. The arguments of this function correspond to the attributes included in the table above. As an example, ligand is a generic category (its scope is generic, its aspect is functional because acting as a ligand is a molecular or biological function). Its source can be resource_specific, for example "all ligands from UniProt", or composite, if it's the combination of ligands from multiple resources. Categories with specific scope might have ligand as their parent, these are specific subclasses of ligands, e.g. interleukin; these specific categories are almost always resource_specific regarding their source. See also the Intercellular signaling roles section under the Methods. The same is true for all other categories, such as receptors, transporters, etc.

n_references and n_resources are simply the count of unique literature references and resources for each interaction record. These might be indicators of the likelihood that the interaction is correct (but not the actual strength of the interaction). If your methods require small network, setting a threshold on these variables might be a way to create a higher confidence but smaller network. These fields are created automatically in OmnipathR after downloading the data, by simply counting the unique values in the sources and references columns.

qijt123 commented 1 year ago

Hi, @deeenes ,

Thank you very much for your patient answer. Your answer has solved my problem. Thank you very much.

Best

qjt