Closed apcamargo closed 3 years ago
Could you provide an example of your input data and code? The weights should have an effect on the results.
Sure!
I must say that I'm not working with genes and pathways. Instead, I'm using ASVs (taxonomic marker sequences) and taxonomies.
# Creates a list associating ASVs (the analogous for genes in this dataset) to their taxonomies (the analogous for pathway)
taxonomy <- read_tsv("asvtax_16S.txt") %>%
rename(ASV = `#OTU ID`) %>%
select(ASV, Family) %>%
deframe()
# Loads the differential abundance result and adds a column containing the taxonomy of the row's ASV
aldex2_result <- read_tsv("differential_abundance_16S.txt") %>%
mutate(taxonomy = as.vector(taxonomy[ASV]))
# Creates the geneset as a list containing vectors of the ASVs belonging to each family
genesets <- list()
taxa_vector <- unique(pull(aldex2_result, taxonomy))
for (taxon in taxa_vector) {
genesets[[taxon]] <- filter(aldex2_result, taxonomy == taxon)$ASV
}
# Creates the signatures and hypeR objects
signature_weighted <- aldex2_result %>% arrange(desc(effect)) %>% select(ASV, effect) %>% deframe()
signature_unweighted <- aldex2_result %>% arrange(desc(effect)) %>% pull(ASV)
hyp_obj_weighted <- hypeR(signature_weighted, genesets, test="kstest", fdr=0.01, do_plots=TRUE)
hyp_obj_unweighted <- hypeR(signature_unweighted, genesets, test="kstest", fdr=0.01, do_plots=TRUE)
print(hyp_obj_weighted$data)
label pval fdr gset.size genes.found score
Spirosomaceae Spirosomaceae 5.1e-07 4.3e-05 19 19 0.62
Hymenobacteraceae Hymenobacteraceae 3.6e-06 1.5e-04 20 20 0.57
Nostocaceae Nostocaceae 3.9e-05 1.1e-03 22 22 0.48
Burkholderiaceae Burkholderiaceae 1.3e-04 2.7e-03 41 41 0.34
print(hyp_obj_unweighted$data)
label pval fdr gset.size genes.found score
Spirosomaceae Spirosomaceae 5.1e-07 4.3e-05 19 19 0.62
Hymenobacteraceae Hymenobacteraceae 3.6e-06 1.5e-04 20 20 0.57
Nostocaceae Nostocaceae 3.9e-05 1.1e-03 22 22 0.48
Burkholderiaceae Burkholderiaceae 1.3e-04 2.7e-03 41 41 0.34
The plots, however, are different:
Weighted:
Unweighted:
There are some other cases where the plots are very different, but the the data in hyp_obj$data
will always correspond to the unweighted version.
sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS/LAPACK: /strg1/users/antoniop.camargo/.environments/community-analysis/lib/libopenblasp-r0.3.10.so
locale:
[1] LC_CTYPE=pt_BR.UTF-8 LC_NUMERIC=C
[3] LC_TIME=pt_BR.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=pt_BR.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=pt_BR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] hypeR_1.2.0 forcats_0.5.0 stringr_1.4.0 dplyr_1.0.2
[5] purrr_0.3.4 readr_1.3.1 tidyr_1.1.2 tibble_3.0.3
[9] ggplot2_3.3.2 tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] Rcpp_1.0.5 lubridate_1.7.9 visNetwork_2.0.9 assertthat_0.2.1
[5] digest_0.6.25 ggforce_0.2.2 IRdisplay_0.7.0 R6_2.4.1
[9] cellranger_1.1.0 repr_1.1.0 backports_1.1.10 evaluate_0.14
[13] httr_1.4.2 pillar_1.4.6 rlang_0.4.7 uuid_0.1-4
[17] readxl_1.3.1 rstudioapi_0.11 DT_0.15 rmarkdown_2.4
[21] labeling_0.3 webshot_0.5.2 htmlwidgets_1.5.1 igraph_1.2.5
[25] polyclip_1.10-0 munsell_0.5.0 broom_0.7.0 compiler_3.6.3
[29] modelr_0.1.8 xfun_0.18 pkgconfig_2.0.3 base64enc_0.1-3
[33] htmltools_0.5.0 tidyselect_1.1.0 fansi_0.4.1 viridisLite_0.3.0
[37] crayon_1.3.4 withr_2.3.0 MASS_7.3-53 grid_3.6.3
[41] jsonlite_1.7.1 gtable_0.3.0 lifecycle_0.2.0 magrittr_1.5
[45] scales_1.1.1 zip_2.1.1 cli_2.0.2 stringi_1.5.3
[49] msigdbr_7.1.1 farver_2.0.3 xml2_1.3.2 ellipsis_0.3.1
[53] generics_0.0.2 vctrs_0.3.4 openxlsx_4.2.2 gh_1.1.0
[57] IRkernel_1.1.1 kableExtra_1.2.1 tools_3.6.3 glue_1.4.2
[61] tweenr_1.0.1 hms_0.5.3 colorspace_1.4-1 rvest_0.3.6
[65] pbdZMQ_0.3-3 knitr_1.30 haven_2.3.1
I experimented changing weights_pwr
. The plots are altered, but the results remain the same.
The p-value calculation does not use the weights, since it is based on the standard ks.test
function, Although the score should be weighted, but wasn't being copied over to the hyp
data component. I just pushed a fix. Try to install the latest version of hypeR through Github.
library(hypeR)
testdat <- readRDS(file.path(system.file("extdata", package="hypeR"), "testdat.rds"))
genesets <- testdat$gsets$genesets
signature <- names(testdat$weighted_signature)
head(signature)
#> [1] "A" "B" "C" "D" "E" "F"
ks_unweighted <- hypeR(signature, genesets, test="kstest", plotting=TRUE)
ks_unweighted$data
#> label pval fdr signature geneset overlap score
#> Leaf Geneset 8 Leaf Geneset 8 0.46 1 26 12 12 0.22
#> Leaf Geneset 1 Leaf Geneset 1 0.64 1 26 10 10 0.18
#> Leaf Geneset 9 Leaf Geneset 9 0.76 1 26 11 11 0.13
#> Leaf Geneset 7 Leaf Geneset 7 0.88 1 26 14 14 -0.19
#> Leaf Geneset 3 Leaf Geneset 3 0.92 1 26 10 10 -0.13
#> Leaf Geneset 6 Leaf Geneset 6 0.95 1 26 10 10 -0.22
#> Leaf Geneset 10 Leaf Geneset 10 0.96 1 26 12 12 -0.12
#> Leaf Geneset 5 Leaf Geneset 5 0.98 1 26 15 15 -0.12
#> Leaf Geneset 2 Leaf Geneset 2 1.00 1 26 15 15 -0.11
#> Leaf Geneset 4 Leaf Geneset 4 1.00 1 26 13 13 -0.15
ks_unweighted$plots$`Leaf Geneset 8`
signature <- testdat$weighted_signature
head(signature)
#> A B C D E F
#> 1.4559884 1.2993123 1.2560188 1.2540831 1.1009691 0.9969869
ks_weighted <- hypeR(signature, genesets, test="kstest", plotting=TRUE)
ks_weighted$data
#> label pval fdr signature geneset overlap score
#> Leaf Geneset 8 Leaf Geneset 8 0.46 1 26 12 12 0.48
#> Leaf Geneset 1 Leaf Geneset 1 0.64 1 26 10 10 0.49
#> Leaf Geneset 9 Leaf Geneset 9 0.76 1 26 11 11 0.42
#> Leaf Geneset 7 Leaf Geneset 7 0.88 1 26 14 14 -0.42
#> Leaf Geneset 3 Leaf Geneset 3 0.92 1 26 10 10 -0.24
#> Leaf Geneset 6 Leaf Geneset 6 0.95 1 26 10 10 -0.45
#> Leaf Geneset 10 Leaf Geneset 10 0.96 1 26 12 12 -0.25
#> Leaf Geneset 5 Leaf Geneset 5 0.98 1 26 15 15 -0.34
#> Leaf Geneset 2 Leaf Geneset 2 1.00 1 26 15 15 -0.30
#> Leaf Geneset 4 Leaf Geneset 4 1.00 1 26 13 13 -0.44
ks_weighted$plots$`Leaf Geneset 8`
Created on 2020-10-01 by the reprex package (v0.3.0)
Solved :)
A somewhat unrelated question: do you think it's "safe" to order/filter pathways based on the score? I have lots pathways with a large number of genes, so even a slight deviation from the null hypothesis is significant and I get a long list of pathways. If I set a more stringent FDR cut-off (1e-10, for example), some pathways with a high score but less genes are filtered out. Therefore, I'm considering using both the score and FDR to select enriched pathways. It would be good to know your thoughts on this approach.
I'm executing
hyperR
with a custom geneset and"kstest"
weighted by the effect size of a differential expression test. I've noticed, however, that the results I get inhyp_obj$data
and the dots plot remains the same whether or not my signature has gene weights.The only thing that changes if I use an weighted input are the "Running Enrichment Score vs. Position in Ranked List of Genes" plot.
Is this behaviour expected? Shouldn't the gene weights affect the results of the statistical test?