Closed kotliary closed 3 years ago
By the way, the mdf
data frame can be easily converted to the list with base split
function:
mdf <- msigdbr(species, category, subcategory) %>%
dplyr::select(gs_name, gene_symbol) %>%
distinct()
gsets <- split(mdf$gene_symbol, mdf$gs_name)
Note that the gene set names will become list names automatically.
Thanks, good catch! Will fix ASAP. Fortunately this bug ends up getting canceled out during the enrichment steps where both the signature and genesets are reduced to unique elements.
signature <- unique(signature)
genesets <- lapply(genesets, unique)
Yes, it's true. The reduce method also removes duplicates.
What version of msigdb are you using?
My version doesn't seem to have duplicated gene symbols in any of the geneset collections I've tested.
hypeR::msigdb_version()
#> [1] "v7.2.1"
g <- hypeR::msigdb_download("Homo sapiens", category="H")
table(duplicated(g$HALLMARK_APICAL_JUNCTION))
#>
#> FALSE
#> 200
Created on 2021-08-23 by the reprex package (v2.0.0)
I use 7.4.1. I believe they added Ensembl ID output in the latest version., around May this year.
Fixed, thank you.
Currently the msigdbr package includes Ensembl IDs in the output data frame of gene sets. Since there are multiple Ensembl IDs corresponding to some genes, your
msigdb_download
function returns duplicated genes in some gene sets. For example, check HALLMARK_APICAL_JUNCTION (Human, H category).You just need to add
distinct() %>%
after the second line below in thedb_msig.R
(lines 148-151):