jtleek / sva-devel

28 stars 45 forks source link

f.pvalue function gives absurd results on zero-variance rows #34

Open DarwinAwardWinner opened 6 years ago

DarwinAwardWinner commented 6 years ago

I would expect the following to return all 1s, since each row is constant:

x <- matrix(rep(c(0, 0.01, 0.1, 1, 10, 100, 1000), 10), ncol=10)
group <- rep(letters[1:2], each=5)
mod <- model.matrix(~group)
mod0 <- cbind(rep(1, 10))
f.pvalue(x, mod, mod0)

However, the result I get is:

> f.pvalue(x, mod, mod0)
[1]         NaN 1.000000000 0.000210317 0.052042817 1.000000000 1.000000000
[7]         NaN

This can cause problems and errors when running functions the use f.pvalue, such as sva.

DarwinAwardWinner commented 6 years ago

Here is my session info, although I have also observed this on another machine running R 3.4.4:

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-apple-darwin17.5.0 (64-bit)
Running under: macOS High Sierra 10.13.4

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libLAPACK.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] grDevices datasets  parallel  graphics  stats4    stats     utils    
[8] methods   base     

other attached packages:
 [1] sva_3.28.0           BiocParallel_1.14.1  genefilter_1.62.0   
 [4] mgcv_1.8-23          nlme_3.1-137         BiocInstaller_1.30.0
 [7] tidyr_0.8.0          devtools_1.13.5      openxlsx_4.0.17     
[10] magrittr_1.5         dplyr_0.7.4          foreach_1.4.4       
[13] plyr_1.8.4           glue_1.2.0           stringr_1.3.1       
[16] GenomicRanges_1.32.2 GenomeInfoDb_1.16.0  IRanges_2.14.8      
[19] ggplot2_2.2.1        S4Vectors_0.18.1     BiocGenerics_0.26.0 

loaded via a namespace (and not attached):
 [1] purrr_0.2.4            splines_3.5.0          lattice_0.20-35       
 [4] colorspace_1.3-2       blob_1.1.1             XML_3.98-1.11         
 [7] survival_2.42-3        rlang_0.2.0            pillar_1.2.2          
[10] withr_2.1.2            DBI_1.0.0              bit64_0.9-7           
[13] bindrcpp_0.2.2         matrixStats_0.53.1     GenomeInfoDbData_1.1.0
[16] bindr_0.1.1            zlibbioc_1.26.0        munsell_0.4.3         
[19] gtable_0.2.0           codetools_0.2-15       memoise_1.1.0         
[22] Biobase_2.40.0         AnnotationDbi_1.42.1   Rcpp_0.12.16          
[25] xtable_1.8-2           scales_0.5.0           limma_3.36.1          
[28] annotate_1.58.0        XVector_0.20.0         bit_1.1-12            
[31] digest_0.6.15          stringi_1.2.2          grid_3.5.0            
[34] tools_3.5.0            bitops_1.0-6           lazyeval_0.2.1        
[37] RCurl_1.95-4.10        tibble_1.4.2           RSQLite_2.1.1         
[40] pkgconfig_2.0.1        Matrix_1.2-14          assertthat_0.2.0      
[43] iterators_1.0.9        R6_2.2.2               compiler_3.5.0        
DarwinAwardWinner commented 6 years ago

For what it's worth, I think that NaN is probably the technically correct p-value for an F-test on a zero-variance row. But other functions that use f.pvalue such as sva should also be modified to handle NaN p-values (most likely by just throwing away those genes).