montilab / hypeR

An R Package for Geneset Enrichment Workflows
https://montilab.github.io/hypeR-docs/
GNU General Public License v3.0
76 stars 11 forks source link

Duplicated gene symbols in the output of msigdb_download #34

Closed kotliary closed 3 years ago

kotliary commented 3 years ago

Currently the msigdbr package includes Ensembl IDs in the output data frame of gene sets. Since there are multiple Ensembl IDs corresponding to some genes, your msigdb_download function returns duplicated genes in some gene sets. For example, check HALLMARK_APICAL_JUNCTION (Human, H category).

You just need to add distinct() %>% after the second line below in the db_msig.R (lines 148-151):

    mdf <- msigdbr(species, category, subcategory) %>%
           dplyr::select(gs_name, gene_symbol) %>%
           as.data.frame() %>%
           stats::aggregate(gene_symbol ~ gs_name, data=., c)
kotliary commented 3 years ago

By the way, the mdf data frame can be easily converted to the list with base split function:

mdf <- msigdbr(species, category, subcategory) %>%
    dplyr::select(gs_name, gene_symbol) %>%
    distinct()
gsets <- split(mdf$gene_symbol, mdf$gs_name)

Note that the gene set names will become list names automatically.

anfederico commented 3 years ago

Thanks, good catch! Will fix ASAP. Fortunately this bug ends up getting canceled out during the enrichment steps where both the signature and genesets are reduced to unique elements.

signature <- unique(signature)
genesets <- lapply(genesets, unique)
kotliary commented 3 years ago

Yes, it's true. The reduce method also removes duplicates.

anfederico commented 3 years ago

What version of msigdb are you using?

My version doesn't seem to have duplicated gene symbols in any of the geneset collections I've tested.

hypeR::msigdb_version()
#> [1] "v7.2.1"
g <- hypeR::msigdb_download("Homo sapiens", category="H")
table(duplicated(g$HALLMARK_APICAL_JUNCTION))
#> 
#> FALSE 
#>   200

Created on 2021-08-23 by the reprex package (v2.0.0)

kotliary commented 3 years ago

I use 7.4.1. I believe they added Ensembl ID output in the latest version., around May this year.

anfederico commented 3 years ago

Fixed, thank you.