saezlab / decoupler-py

Python package to perform enrichment analysis from omics data.
https://decoupler-py.readthedocs.io/
GNU General Public License v3.0
145 stars 21 forks source link

Mouse genes #5

Closed laurie-tonon closed 1 year ago

laurie-tonon commented 2 years ago

Hello,

I am trying to analyse a dataset of mouse cells and would like to perform over-enrichment analyses and trajectory inferences with decoupler. However I can't download ressources for mouse (only Progeny), and the functions complains that the genes identifiers are not the same. Is there a way to use these functions with mouse data?

Thanks a lot

deeenes commented 2 years ago

Hi,

PROGENy and other annotation resources are not yet available for organisms other than human. However, you can easily translate them by orthology. Running these for the first time might take long as it requires many downloads from Ensembl, HomoloGene and UniProt. Subsequent runs work from cache and take only a few seconds. Here we use the modules pypath and omnipath, which are available by pip:

pip install https://github.com/saezlab/omnipath
pip install https://github.com/saezlab/pypath
import omnipath
from pypath.utils import homology, mapping

progeny = omnipath.requests.Annotations.get(resources = 'PROGENy', wide = True)
progeny['mouse_uniprot'] = [homology.translate(u, 10090) for u in progeny.uniprot]
progeny = progeny.explode('mouse_uniprot')
progeny['mouse_genesymbol'] = [mapping.label(u, ncbi_tax_id = 10090) for u in progeny.mouse_uniprot]

progeny
#        uniprot genesymbol entity_type   p_value   pathway    weight mouse_uniprot mouse_genesymbol
# 0       P35250       RFC2     protein  0.624086     Trail -0.800677        Q9WUK4             Rfc2
# 1       P35250       RFC2     protein  0.000704   Hypoxia -2.049501        Q9WUK4             Rfc2
# 2       P35250       RFC2     protein  0.001655      EGFR  1.470647        Q9WUK4             Rfc2
# 3       P35250       RFC2     protein  0.833456      TNFa -0.124993        Q9WUK4             Rfc2
# 4       P35250       RFC2     protein  0.630460      TGFb -0.430508        Q9WUK4             Rfc2
# ...        ...        ...         ...       ...       ...       ...           ...              ...
# 233402  Q96A11    GAL3ST3     protein  0.236295      PI3K -0.228038        P61315          Gal3st3
# 233403  Q96A11    GAL3ST3     protein  0.705764  JAK-STAT  0.052601        P61315          Gal3st3
# 233404  Q96A11    GAL3ST3     protein  0.575544      EGFR  0.070407        P61315          Gal3st3
# 233405  Q96A11    GAL3ST3     protein  0.988972     Trail -0.005215        P61315          Gal3st3
# 233406  Q96A11    GAL3ST3     protein  0.607089   Hypoxia  0.063501        P61315          Gal3st3
# 
# [237671 rows x 8 columns]

I hope this helps.

Best,

Denes

laurie-tonon commented 2 years ago

Thanks a lot, that should indeed help me. I tried to run your example but it throws an error. I installed pypath-omnipath via pip and everything is fine. But when I want to import via:

from pypath.utils import homology,mapping

I have an error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Input In [2], in <module>
      1 import omnipath
----> 2 from pypath.utils import homology,mapping

File ~/opt/miniconda3/envs/scanpy/lib/python3.9/site-packages/pypath/utils/homology.py:41, in <module>
     37 import pickle
     39 import timeloop
---> 41 import pypath.utils.mapping as mapping
     42 import pypath.share.common as common
     43 import pypath.internals.intera as intera

File ~/opt/miniconda3/envs/scanpy/lib/python3.9/site-packages/pypath/utils/mapping.py:73, in <module>
     71 import pypath.inputs.uniprot as uniprot_input
     72 import pypath.inputs.pro as pro_input
---> 73 import pypath.inputs.biomart as biomart_input
     74 import pypath.inputs.unichem as unichem_input
     75 import pypath.internals.input_formats as input_formats

File ~/opt/miniconda3/envs/scanpy/lib/python3.9/site-packages/pypath/inputs/biomart.py:36, in <module>
     34 import pypath.share.curl as curl
     35 import pypath.resources.urls as urls
---> 36 import pypath.utils.taxonomy as taxonomy
     38 _logger = session_mod.Logger(name = 'biomart_input')
     41 # for mouse homologues: Filter name = "with_mmusculus_homolog"

File ~/opt/miniconda3/envs/scanpy/lib/python3.9/site-packages/pypath/utils/taxonomy.py:88, in <module>
     49 # XXX: Shouldn't we keep all functions and variables separated
     50 #      (together among them)?
     51 taxids = {
     52     9606: 'human',
     53     10090: 'mouse',
   (...)
     80     9544: 'rhesus macaque',
     81 }
     83 taxids2 = dict(
     84     (
     85         t.taxon_id,
     86         t.common_name.lower()
     87     )
---> 88     for t in ensembl_input.ensembl_organisms()
     89 )
     91 taxa = common.swap_dict_simple(taxids)
     92 taxa2 = common.swap_dict_simple(taxids2)

File ~/opt/miniconda3/envs/scanpy/lib/python3.9/site-packages/pypath/inputs/ensembl.py:52, in ensembl_organisms()
     49 c = curl.Curl(url)
     50 soup = bs4.BeautifulSoup(c.result, 'html.parser')
---> 52 for r in soup.find('table').find_all('tr'):
     54     if not record:
     56         record = collections.namedtuple(
     57             'EnsemblOrganism',
     58             [c.text.lower().replace(' ', '_') for c in r] +
     59             ['ensembl_name']
     60         )

AttributeError: 'NoneType' object has no attribute 'find_all'

I tried to clear the cache in ~/.pypath and rerun but no effect.

Did you already see this error?

Thanks

laurie-tonon commented 2 years ago

It seems the problem is from the Ensembl website. The server at https://www.ensembl.org/info/about/species.html is down and so the package cannot be loaded.

deeenes commented 2 years ago

Yes, the Ensembl server is having issues today, it's up again now but still slow. Unfortunately, without Ensembl the homology translation won't work, but this kind of error doesn't happen often. Ensembl has 4 mirrors, maybe I will later add an option for choosing mirror.

chrarnold commented 1 year ago

Hi, is there also an R-only solution to create all necessary databases and networks for the rat genome? pypath has an issue it seems, and cannot be installed via python, and I generally like to stay within R ;). Thanks!

deeenes commented 1 year ago

Hi @chrarnold ,

In R it could look like this:

library(OmnipathR)
library(dplyr)

progeny <- import_omnipath_annotations(resources = 'PROGENy', wide = TRUE)
human_rat <- homologene_uniprot_orthology(target = 10116L, by = genesymbol)

progeny_rat <-
    progeny %>%
    inner_join(human_rat, by = c('uniprot' = 'source')) %>%
    mutate(uniprot = target) %>%
    select(-target, -genesymbol) %>%
    translate_ids(uniprot, genesymbol, organism = 10116L) %>%
    relocate(genesymbol, .after = uniprot)

progeny_rat
# A tibble: 86,038 × 6
   uniprot genesymbol entity_type pathway    weight  p_value
   <chr>   <chr>      <chr>       <chr>       <dbl>    <dbl>
 1 Q641W4  Rfc2       protein     Hypoxia  -2.05    7.04e- 4
 2 Q641W4  Rfc2       protein     TGFb     -0.431   6.30e- 1
 3 Q641W4  Rfc2       protein     NFkB     -0.410   3.72e- 1
 4 Q641W4  Rfc2       protein     p53      -3.35    9.86e- 4
 5 Q641W4  Rfc2       protein     TNFa     -0.125   8.33e- 1
 6 Q641W4  Rfc2       protein     EGFR      1.47    1.66e- 3
 7 Q641W4  Rfc2       protein     Trail    -0.801   6.24e- 1
 8 Q641W4  Rfc2       protein     JAK-STAT  0.00122 9.98e- 1
 9 Q641W4  Rfc2       protein     MAPK      2.28    3.32e-11
10 Q641W4  Rfc2       protein     VEGF     -0.157   8.48e- 1
# … with 86,028 more rows

If you have already OmnipathR installed, please update it to the most recent version (3.4.3 or 3.5.6): due to the recent UniProt URL and API update the above example won't work with earlier versions.

Best,

Denes

chrarnold commented 1 year ago

Thanks a lot Denes for his, this is helpful for the whole community I think! For the Bio release version, the newest version is only 3.4.0 currently, you mentioned 3.4.3, so installation from Github via devtools::install_github('saezlab/OmnipathR') is necessary I think and it worked like a charm.

livyring commented 1 year ago

Is there a way to add the mouse msigdb database as a resource to pull from the dc.get_resource funtion? It exists on the GSEA site but is not built into the wrapper. If this is not possible, how can I load it in myself? I am trying to run an analysis on the functional enrichment of biological terms. Thanks!

deeenes commented 1 year ago

Most of the mouse database knowledge is orthology translated from human, I believe MSigDB is no different. They write here:

an orthology converted version of these sets is being provided here to allow analysis in the mouse gene-space alongside other, mouse-native, sets

However, they don't tell which ones are the mouse-native sets. I think M1 and M8 are definitely, but the rest are more likely to be orthology translated, either by MSigDB or its primary resources.

The two options here:

1) Load the human MSigDB and translate to mouse by orthology as shown in my first comment.

2) Using our database builder module pypath, process the MSigDB mouse data and write a little custom code to extract the desired data frame from the dictionaries provided by pypath. Something like this:

from pypath.inputs import msigdb
import pandas as pd

msigdb_mouse = msigdb.msigdb_annotations(organism = 'mouse')

msigdb_mouse_df = pd.DataFrame(
    [(k,) + a for k, v in msigdb_mouse.items() for a in v],
    columns = ['uniprot', 'collection', 'geneset']
)

msigdb_mouse_df
       uniprot                          collection                                           geneset
0       Q9WVC6                 mirna_targets_mirdb                                        MIR_322_5P
1       Q9WVC6                 mirna_targets_mirdb                                       MIR_497A_5P
2       Q9WVC6  chemical_and_genetic_perturbations                        CADWELL_ATG16L1_TARGETS_DN
3       Q9WVC6  chemical_and_genetic_perturbations                      LEIN_OLIGODENDROCYTE_MARKERS
4       Q9WVC6  chemical_and_genetic_perturbations      GRAESSMANN_APOPTOSIS_BY_SERUM_DEPRIVATION_UP
...        ...                                 ...                                               ...
568859  Q80T03                   reactome_pathways                   REACTOME_O_LINKED_GLYCOSYLATION
568860  Q80T03                   reactome_pathways     REACTOME_TERMINATION_OF_O_GLYCAN_BIOSYNTHESIS
568861  Q80T03                   reactome_pathways         REACTOME_O_LINKED_GLYCOSYLATION_OF_MUCINS
568862  Q80T03                   reactome_pathways  REACTOME_POST_TRANSLATIONAL_PROTEIN_MODIFICATION
568863  Q80T03                   reactome_pathways                   REACTOME_METABOLISM_OF_PROTEINS

[568864 rows x 3 columns]

If you prefer gene symbols instead of UniProts, use the msigdb_download_collections function:

from pypath.inputs import msigdb
import pandas as pd

msigdb_mouse_raw = msigdb.msigdb_download_collections(organism = 'mouse')

msigdb_mouse_raw_df = pd.DataFrame(
    [
        (collname, collcode, gset, gene)
        for (collname, collcode), coll in msigdb_mouse_raw.items()
        for gset, genes in coll.items()
        for gene in genes
    ],
    columns = ['collection', 'code', 'geneset', 'genesymbol']
)

msigdb_mouse_raw_df
                  collection    code                                            geneset genesymbol
0                   hallmark  mh.all                   HALLMARK_TNFA_SIGNALING_VIA_NFKB      Dusp1
1                   hallmark  mh.all                   HALLMARK_TNFA_SIGNALING_VIA_NFKB    Tnfaip3
2                   hallmark  mh.all                   HALLMARK_TNFA_SIGNALING_VIA_NFKB     Sqstm1
3                   hallmark  mh.all                   HALLMARK_TNFA_SIGNALING_VIA_NFKB      Rcan1
4                   hallmark  mh.all                   HALLMARK_TNFA_SIGNALING_VIA_NFKB       Egr2
...                      ...     ...                                                ...        ...
667940  cell_type_signatures  m8.all  TABULA_MURIS_SENISTRACHEA_SMOOTH_MUSCLE_CELL_O...     S100a1
667941  cell_type_signatures  m8.all  TABULA_MURIS_SENISTRACHEA_SMOOTH_MUSCLE_CELL_O...       Jund
667942  cell_type_signatures  m8.all  TABULA_MURIS_SENISTRACHEA_SMOOTH_MUSCLE_CELL_O...        Msn
667943  cell_type_signatures  m8.all  TABULA_MURIS_SENISTRACHEA_SMOOTH_MUSCLE_CELL_O...       Tle5
667944  cell_type_signatures  m8.all  TABULA_MURIS_SENISTRACHEA_SMOOTH_MUSCLE_CELL_O...       Dcxr

[667945 rows x 4 columns]

Note: by default the c5 or m5 geneset collections (mostly gene ontology) are disabled, see the exclude argument. MSigDB recently changed a few things on their web page, and until now the pypath.inputs.msigdb module didn't explicitly support mouse. Hence I had to update the code in pypath. For this reason, the above example above works only with the current head of master branch (v0.14.17):

pip3 install 'git+https://github.com/saezlab/pypath.git'
livyring commented 1 year ago

downloading pypath with the code above, I received an error:

ERROR: Package 'pypath-omnipath' requires a different Python: 3.8.13 not in '<4.0,>=3.9' Note: you may need to restart the kernel to use updated packages.

Is there a way to use this package without having to downdate my python?

deeenes commented 1 year ago

What's your Python version? Not a downgrade, but an upgrade should be necessary. Currently 3.9 is the minimum required version for pypath.

PauBadiaM commented 1 year ago

Closing this issue since now it is implemented as a function translate_net in decoupler-1.3.0. Here is a vignette showcasing how to do it: https://decoupler-py.readthedocs.io/en/latest/notebooks/translate.html