Weighted and unweighted "kstest" giving the same results

apcamargo commented 4 years ago

I'm executing hyperR with a custom geneset and "kstest" weighted by the effect size of a differential expression test. I've noticed, however, that the results I get in hyp_obj$data and the dots plot remains the same whether or not my signature has gene weights.

The only thing that changes if I use an weighted input are the "Running Enrichment Score vs. Position in Ranked List of Genes" plot.

Is this behaviour expected? Shouldn't the gene weights affect the results of the statistical test?

anfederico commented 4 years ago

Could you provide an example of your input data and code? The weights should have an effect on the results.

apcamargo commented 4 years ago

Sure!

I must say that I'm not working with genes and pathways. Instead, I'm using ASVs (taxonomic marker sequences) and taxonomies.

# Creates a list associating ASVs (the analogous for genes in this dataset) to their taxonomies (the analogous for pathway)
taxonomy <- read_tsv("asvtax_16S.txt") %>%
  rename(ASV = `#OTU ID`) %>%
  select(ASV, Family) %>%
  deframe()

# Loads the differential abundance result and adds a column containing the taxonomy of the row's ASV
aldex2_result <- read_tsv("differential_abundance_16S.txt") %>%
  mutate(taxonomy = as.vector(taxonomy[ASV]))

# Creates the geneset as a list containing vectors of the ASVs belonging to each family
genesets <- list()
taxa_vector <- unique(pull(aldex2_result, taxonomy))
for (taxon in taxa_vector) {
    genesets[[taxon]] <- filter(aldex2_result, taxonomy == taxon)$ASV
}

# Creates the signatures and hypeR objects
signature_weighted <- aldex2_result %>% arrange(desc(effect)) %>% select(ASV, effect) %>% deframe()
signature_unweighted <- aldex2_result %>% arrange(desc(effect)) %>% pull(ASV)
hyp_obj_weighted <- hypeR(signature_weighted, genesets, test="kstest", fdr=0.01, do_plots=TRUE)
hyp_obj_unweighted <- hypeR(signature_unweighted, genesets, test="kstest", fdr=0.01, do_plots=TRUE)

print(hyp_obj_weighted$data)
                              label    pval     fdr gset.size genes.found score
Spirosomaceae         Spirosomaceae 5.1e-07 4.3e-05        19          19  0.62
Hymenobacteraceae Hymenobacteraceae 3.6e-06 1.5e-04        20          20  0.57
Nostocaceae             Nostocaceae 3.9e-05 1.1e-03        22          22  0.48
Burkholderiaceae   Burkholderiaceae 1.3e-04 2.7e-03        41          41  0.34

print(hyp_obj_unweighted$data)
                              label    pval     fdr gset.size genes.found score
Spirosomaceae         Spirosomaceae 5.1e-07 4.3e-05        19          19  0.62
Hymenobacteraceae Hymenobacteraceae 3.6e-06 1.5e-04        20          20  0.57
Nostocaceae             Nostocaceae 3.9e-05 1.1e-03        22          22  0.48
Burkholderiaceae   Burkholderiaceae 1.3e-04 2.7e-03        41          41  0.34

The plots, however, are different:

Weighted:

Unweighted:

There are some other cases where the plots are very different, but the the data in hyp_obj$data will always correspond to the unweighted version.

sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-conda_cos6-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS/LAPACK: /strg1/users/antoniop.camargo/.environments/community-analysis/lib/libopenblasp-r0.3.10.so

locale:
 [1] LC_CTYPE=pt_BR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=pt_BR.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=pt_BR.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=pt_BR.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] hypeR_1.2.0     forcats_0.5.0   stringr_1.4.0   dplyr_1.0.2    
 [5] purrr_0.3.4     readr_1.3.1     tidyr_1.1.2     tibble_3.0.3   
 [9] ggplot2_3.3.2   tidyverse_1.2.1

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5        lubridate_1.7.9   visNetwork_2.0.9  assertthat_0.2.1 
 [5] digest_0.6.25     ggforce_0.2.2     IRdisplay_0.7.0   R6_2.4.1         
 [9] cellranger_1.1.0  repr_1.1.0        backports_1.1.10  evaluate_0.14    
[13] httr_1.4.2        pillar_1.4.6      rlang_0.4.7       uuid_0.1-4       
[17] readxl_1.3.1      rstudioapi_0.11   DT_0.15           rmarkdown_2.4    
[21] labeling_0.3      webshot_0.5.2     htmlwidgets_1.5.1 igraph_1.2.5     
[25] polyclip_1.10-0   munsell_0.5.0     broom_0.7.0       compiler_3.6.3   
[29] modelr_0.1.8      xfun_0.18         pkgconfig_2.0.3   base64enc_0.1-3  
[33] htmltools_0.5.0   tidyselect_1.1.0  fansi_0.4.1       viridisLite_0.3.0
[37] crayon_1.3.4      withr_2.3.0       MASS_7.3-53       grid_3.6.3       
[41] jsonlite_1.7.1    gtable_0.3.0      lifecycle_0.2.0   magrittr_1.5     
[45] scales_1.1.1      zip_2.1.1         cli_2.0.2         stringi_1.5.3    
[49] msigdbr_7.1.1     farver_2.0.3      xml2_1.3.2        ellipsis_0.3.1   
[53] generics_0.0.2    vctrs_0.3.4       openxlsx_4.2.2    gh_1.1.0         
[57] IRkernel_1.1.1    kableExtra_1.2.1  tools_3.6.3       glue_1.4.2       
[61] tweenr_1.0.1      hms_0.5.3         colorspace_1.4-1  rvest_0.3.6      
[65] pbdZMQ_0.3-3      knitr_1.30        haven_2.3.1

apcamargo commented 4 years ago

I experimented changing weights_pwr. The plots are altered, but the results remain the same.

anfederico commented 4 years ago

The p-value calculation does not use the weights, since it is based on the standard ks.test function, Although the score should be weighted, but wasn't being copied over to the hyp data component. I just pushed a fix. Try to install the latest version of hypeR through Github.

library(hypeR)

testdat <- readRDS(file.path(system.file("extdata", package="hypeR"), "testdat.rds"))
genesets <- testdat$gsets$genesets

signature <- names(testdat$weighted_signature)
head(signature)
#> [1] "A" "B" "C" "D" "E" "F"
ks_unweighted <- hypeR(signature, genesets, test="kstest", plotting=TRUE)
ks_unweighted$data
#>                           label pval fdr signature geneset overlap score
#> Leaf Geneset 8   Leaf Geneset 8 0.46   1        26      12      12  0.22
#> Leaf Geneset 1   Leaf Geneset 1 0.64   1        26      10      10  0.18
#> Leaf Geneset 9   Leaf Geneset 9 0.76   1        26      11      11  0.13
#> Leaf Geneset 7   Leaf Geneset 7 0.88   1        26      14      14 -0.19
#> Leaf Geneset 3   Leaf Geneset 3 0.92   1        26      10      10 -0.13
#> Leaf Geneset 6   Leaf Geneset 6 0.95   1        26      10      10 -0.22
#> Leaf Geneset 10 Leaf Geneset 10 0.96   1        26      12      12 -0.12
#> Leaf Geneset 5   Leaf Geneset 5 0.98   1        26      15      15 -0.12
#> Leaf Geneset 2   Leaf Geneset 2 1.00   1        26      15      15 -0.11
#> Leaf Geneset 4   Leaf Geneset 4 1.00   1        26      13      13 -0.15
ks_unweighted$plots$`Leaf Geneset 8`


signature <- testdat$weighted_signature
head(signature)
#>         A         B         C         D         E         F 
#> 1.4559884 1.2993123 1.2560188 1.2540831 1.1009691 0.9969869
ks_weighted <- hypeR(signature, genesets, test="kstest", plotting=TRUE)
ks_weighted$data
#>                           label pval fdr signature geneset overlap score
#> Leaf Geneset 8   Leaf Geneset 8 0.46   1        26      12      12  0.48
#> Leaf Geneset 1   Leaf Geneset 1 0.64   1        26      10      10  0.49
#> Leaf Geneset 9   Leaf Geneset 9 0.76   1        26      11      11  0.42
#> Leaf Geneset 7   Leaf Geneset 7 0.88   1        26      14      14 -0.42
#> Leaf Geneset 3   Leaf Geneset 3 0.92   1        26      10      10 -0.24
#> Leaf Geneset 6   Leaf Geneset 6 0.95   1        26      10      10 -0.45
#> Leaf Geneset 10 Leaf Geneset 10 0.96   1        26      12      12 -0.25
#> Leaf Geneset 5   Leaf Geneset 5 0.98   1        26      15      15 -0.34
#> Leaf Geneset 2   Leaf Geneset 2 1.00   1        26      15      15 -0.30
#> Leaf Geneset 4   Leaf Geneset 4 1.00   1        26      13      13 -0.44
ks_weighted$plots$`Leaf Geneset 8`

^{Created on 2020-10-01 by the reprex package (v0.3.0)}

apcamargo commented 4 years ago

Solved :)

A somewhat unrelated question: do you think it's "safe" to order/filter pathways based on the score? I have lots pathways with a large number of genes, so even a slight deviation from the null hypothesis is significant and I get a long list of pathways. If I set a more stringent FDR cut-off (1e-10, for example), some pathways with a high score but less genes are filtered out. Therefore, I'm considering using both the score and FDR to select enriched pathways. It would be good to know your thoughts on this approach.

montilab / hypeR

Weighted and unweighted "kstest" giving the same results #26