igordot / msigdbr

MSigDB gene sets for multiple organisms in a tidy data format
https://igordot.github.io/msigdbr
Other
70 stars 14 forks source link

No gene sets from KEGG, REACTOME or BIOCARTA #4

Closed enricoferrero closed 5 years ago

enricoferrero commented 5 years ago

It looks like it's no longer possible to get gene sets from KEGG, REACTOME or BIOCARTA:

c2_reactome <- msigdbr(category = "C2", subcategory = "REACTOME") %>%
  split(x = .$gene_symbol, f = .$gs_name)
> length(c2_reactome)
[1] 0

Can these be restored? Thank you.

igordot commented 5 years ago

Those gene sets should be there. The subcategory needs to be CP:REACTOME, not REACTOME, which is how they are labeled by MSigDB: image

enricoferrero commented 5 years ago

Thank you! You're right.

One more thing though: when selecting C2 CP, I only get 252 gene sets instead of 1329, with the ones from KEGG, Reactome and Biocarta missing:

c2_cp <- msigdbr(category = "C2", subcategory = "CP") %>%
  split(x = .$gene_symbol, f = .$gs_name)
length(c2_cp)
[1] 252

Is this by design? I think C2 CP should contain all canonical pathways, not only those that are not part of KEGG, Reactome and Biocarta.

igordot commented 5 years ago

That's a good point. All the subcategories except for CP match the official counts from the MSigDB website:

> msigdbr(category = "C2") %>% distinct(gs_subcat, gs_name) %>% count(gs_subcat)
# A tibble: 5 x 2
  gs_subcat       n
  <chr>       <int>
1 CGP          3433
2 CP            252
3 CP:BIOCARTA   217
4 CP:KEGG       186
5 CP:REACTOME   674

The MSigDB website shows all the CP* gene sets as part of CP. However, when you download the data, each gene set only has a single subcategory listed without a tree-like structure. That is the classification used by msigdbr.

The msigdbr output is a data frame. To capture all the CP* gene sets, you could filter the C2 table manually:

> msigdbr(category = "C2") %>% filter(gs_subcat != "CGP") %>% split(x = .$gene_symbol, f = .$gs_name) %>% length()
[1] 1329
> msigdbr(category = "C2") %>% filter(grepl("CP", gs_subcat)) %>% split(x = .$gene_symbol, f = .$gs_name) %>% length()
[1] 1329
> msigdbr(category = "C2") %>% filter(stringr::str_detect(gs_subcat, "CP")) %>% split(x = .$gene_symbol, f = .$gs_name) %>% length()
[1] 1329
rauldiul commented 5 years ago

Hi,

sorry to expand on this question. Could you provide some clarification on the clean up made for human genes with 1:many orthologs?

In the Details section of the vignette, a statement says:

For each human equivalent within each species, only the ortholog supported by the largest number of databases is used.

However, if I load for example the hallmark categories for human and mouse, msigdbr(species = "Homo sapiens", category = "H"); msigdbr(species = "Mus musculus", category = "H"), there are some pathways with differing number of genes. For example, HALLMARK_ALLOGRAFT_REJECTION has 205 genes for Mouse and 200 for Human.

Can this be because some human genes had two (or more) equally supported mouse orthologs in HUGO/HCOP, so both are present in the mouse gene set? or is there another reason?

thanks for your help, and for the package

igordot commented 5 years ago

Can this be because some human genes had two (or more) equally supported mouse orthologs in HUGO/HCOP, so both are present in the mouse gene set?

Yes.

The output data frame contains both the original human symbol and the ortholog, so you could always check:

> mm = msigdbr(species = "Mus musculus", category = "H")
> mm %>%
  filter(gs_name == "HALLMARK_ALLOGRAFT_REJECTION") %>%
  select(gs_name, human_gene_symbol, gene_symbol, sources) %>%
  add_count(human_gene_symbol) %>%
  arrange(-n)
# A tibble: 205 x 5
   gs_name        human_gene_symb… gene_symbol sources                          n
   <chr>          <chr>            <chr>       <chr>                        <int>
 1 HALLMARK_ALLO… HLA-A            H2-D1       PhylomeDB,PhylomeDB,HomoloG…     6
 2 HALLMARK_ALLO… HLA-A            H2-Q1       PhylomeDB,PhylomeDB,HomoloG…     6
 3 HALLMARK_ALLO… HLA-A            H2-Q10      PhylomeDB,PhylomeDB,HomoloG…     6
 4 HALLMARK_ALLO… HLA-A            H2-Q2       PhylomeDB,PhylomeDB,HomoloG…     6
 5 HALLMARK_ALLO… HLA-A            H2-Q4       PhylomeDB,PhylomeDB,HomoloG…     6
 6 HALLMARK_ALLO… HLA-A            H2-Q7       PhylomeDB,PhylomeDB,HomoloG…     6
 7 HALLMARK_ALLO… AARS             Aars        Inparanoid,PhylomeDB,Homolo…     1
 8 HALLMARK_ALLO… ABCE1            Abce1       Inparanoid,PhylomeDB,Homolo…     1
 9 HALLMARK_ALLO… ABI1             Abi1        Inparanoid,PhylomeDB,Homolo…     1
10 HALLMARK_ALLO… ACHE             Ache        Inparanoid,PhylomeDB,Homolo…     1
# … with 195 more rows

For this example, you can see that human HLA-A has multiple mouse equivalents.