Closed enricoferrero closed 5 years ago
Those gene sets should be there. The subcategory needs to be CP:REACTOME
, not REACTOME
, which is how they are labeled by MSigDB:
Thank you! You're right.
One more thing though: when selecting C2 CP, I only get 252 gene sets instead of 1329, with the ones from KEGG, Reactome and Biocarta missing:
c2_cp <- msigdbr(category = "C2", subcategory = "CP") %>%
split(x = .$gene_symbol, f = .$gs_name)
length(c2_cp)
[1] 252
Is this by design? I think C2 CP should contain all canonical pathways, not only those that are not part of KEGG, Reactome and Biocarta.
That's a good point. All the subcategories except for CP
match the official counts from the MSigDB website:
> msigdbr(category = "C2") %>% distinct(gs_subcat, gs_name) %>% count(gs_subcat)
# A tibble: 5 x 2
gs_subcat n
<chr> <int>
1 CGP 3433
2 CP 252
3 CP:BIOCARTA 217
4 CP:KEGG 186
5 CP:REACTOME 674
The MSigDB website shows all the CP*
gene sets as part of CP
. However, when you download the data, each gene set only has a single subcategory listed without a tree-like structure. That is the classification used by msigdbr
.
The msigdbr
output is a data frame. To capture all the CP*
gene sets, you could filter the C2
table manually:
> msigdbr(category = "C2") %>% filter(gs_subcat != "CGP") %>% split(x = .$gene_symbol, f = .$gs_name) %>% length()
[1] 1329
> msigdbr(category = "C2") %>% filter(grepl("CP", gs_subcat)) %>% split(x = .$gene_symbol, f = .$gs_name) %>% length()
[1] 1329
> msigdbr(category = "C2") %>% filter(stringr::str_detect(gs_subcat, "CP")) %>% split(x = .$gene_symbol, f = .$gs_name) %>% length()
[1] 1329
Hi,
sorry to expand on this question. Could you provide some clarification on the clean up made for human genes with 1:many orthologs?
In the Details section of the vignette, a statement says:
For each human equivalent within each species, only the ortholog supported by the largest number of databases is used.
However, if I load for example the hallmark categories for human and mouse,
msigdbr(species = "Homo sapiens", category = "H"); msigdbr(species = "Mus musculus", category = "H")
, there are some pathways with differing number of genes. For example, HALLMARK_ALLOGRAFT_REJECTION
has 205 genes for Mouse and 200 for Human.
Can this be because some human genes had two (or more) equally supported mouse orthologs in HUGO/HCOP, so both are present in the mouse gene set? or is there another reason?
thanks for your help, and for the package
Can this be because some human genes had two (or more) equally supported mouse orthologs in HUGO/HCOP, so both are present in the mouse gene set?
Yes.
The output data frame contains both the original human symbol and the ortholog, so you could always check:
> mm = msigdbr(species = "Mus musculus", category = "H")
> mm %>%
filter(gs_name == "HALLMARK_ALLOGRAFT_REJECTION") %>%
select(gs_name, human_gene_symbol, gene_symbol, sources) %>%
add_count(human_gene_symbol) %>%
arrange(-n)
# A tibble: 205 x 5
gs_name human_gene_symb… gene_symbol sources n
<chr> <chr> <chr> <chr> <int>
1 HALLMARK_ALLO… HLA-A H2-D1 PhylomeDB,PhylomeDB,HomoloG… 6
2 HALLMARK_ALLO… HLA-A H2-Q1 PhylomeDB,PhylomeDB,HomoloG… 6
3 HALLMARK_ALLO… HLA-A H2-Q10 PhylomeDB,PhylomeDB,HomoloG… 6
4 HALLMARK_ALLO… HLA-A H2-Q2 PhylomeDB,PhylomeDB,HomoloG… 6
5 HALLMARK_ALLO… HLA-A H2-Q4 PhylomeDB,PhylomeDB,HomoloG… 6
6 HALLMARK_ALLO… HLA-A H2-Q7 PhylomeDB,PhylomeDB,HomoloG… 6
7 HALLMARK_ALLO… AARS Aars Inparanoid,PhylomeDB,Homolo… 1
8 HALLMARK_ALLO… ABCE1 Abce1 Inparanoid,PhylomeDB,Homolo… 1
9 HALLMARK_ALLO… ABI1 Abi1 Inparanoid,PhylomeDB,Homolo… 1
10 HALLMARK_ALLO… ACHE Ache Inparanoid,PhylomeDB,Homolo… 1
# … with 195 more rows
For this example, you can see that human HLA-A has multiple mouse equivalents.
It looks like it's no longer possible to get gene sets from KEGG, REACTOME or BIOCARTA:
Can these be restored? Thank you.