haowulab / DSS

14 stars 13 forks source link

Singular matrix error in solve() when running waldTest() #9

Open aedanr opened 4 years ago

aedanr commented 4 years ago

Hi,

I'm encountering an error when running waldTest() on some simulated RNA-seq data when I supply normalisation factors. The error doesn't occur when using normalisation factors computed using DSS even though the normalisation factors are very similar.

The error occurs in solve(G0), which is called via loccov2(), and seems to be caused by the matrix G0 being singular.

The error can be reproduced using the data and code here: https://github.com/aedanr/PhD-Project-2.3/blob/master/raw.counts.DE50.1.rds https://github.com/aedanr/PhD-Project-2.3/blob/master/troubleshooting_DSS_issue.R:

library(DSS)
library(edgeR)
rawdata <- readRDS('raw.counts.DE50.1.rds')
group <- factor(c(rep(1,50), rep(2,50)))
libsizes <- colSums(rawdata)
nf.tmm <- calcNormFactors(rawdata, method="TMM")
els.tmm <- nf.tmm * libsizes
sf.tmm <- els.tmm / exp(mean(log(libsizes)))

dat.DSS.tmm <- newSeqCountSet(counts=matrix(rawdata, ncol=length(group)), 
                              designs=as.numeric(group), 
                              normalizationFactor=sf.tmm)
dat.DSS.tmm <- estDispersion(dat.DSS.tmm)
res.DSS.tmm <- waldTest(dat.DSS.tmm, 1, 2) # error in solve.default(G0)

dat.DSS.default <- newSeqCountSet(counts=matrix(rawdata, ncol=length(group)), 
                                  designs=as.numeric(group))
dat.DSS.default <- estNormFactors(dat.DSS.default)
dat.DSS.default <- estDispersion(dat.DSS.default)
res.DSS.default <- waldTest(dat.DSS.default, 1, 2) # runs with warnings

Do you have any ideas on how to avoid this?

Thanks.

> sessionInfo()

R version 3.6.1 (2019-07-05)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18362)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.1252  LC_CTYPE=English_Australia.1252    LC_MONETARY=English_Australia.1252 LC_NUMERIC=C                      
[5] LC_TIME=English_Australia.1252    

attached base packages:
 [1] splines   stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] edgeR_3.26.8                limma_3.40.6                DSS_2.32.0                  bsseq_1.20.0                SummarizedExperiment_1.14.1
 [6] DelayedArray_0.10.0         BiocParallel_1.18.1         matrixStats_0.55.0          GenomicRanges_1.36.1        GenomeInfoDb_1.20.0        
[11] IRanges_2.18.3              S4Vectors_0.22.1            Biobase_2.44.0              BiocGenerics_0.30.0        

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.3               compiler_3.6.1           XVector_0.24.0           R.methodsS3_1.7.1        R.utils_2.9.2            bitops_1.0-6            
 [7] tools_3.6.1              DelayedMatrixStats_1.6.1 zlibbioc_1.30.0          lifecycle_0.1.0          rhdf5_2.28.1             lattice_0.20-38         
[13] BSgenome_1.52.0          rlang_0.4.2              Matrix_1.2-18            GenomeInfoDbData_1.2.1   rtracklayer_1.44.4       Biostrings_2.52.0       
[19] gtools_3.8.1             locfit_1.5-9.1           grid_3.6.1               data.table_1.12.6        R6_2.4.1                 HDF5Array_1.12.3        
[25] XML_3.98-1.20            Rhdf5lib_1.6.3           GenomicAlignments_1.20.1 scales_1.1.0             Rsamtools_2.0.3          permute_0.9-5           
[31] colorspace_1.4-1         RCurl_1.95-4.12          munsell_0.5.0            R.oo_1.23.0             
haowulab commented 4 years ago

This is very strange. The error is caused by locfdr (we used local FDR method to adjust for multiple testing). Apparently the only difference is the way of computing normalization factor. Using TMM caused this problem.

I debugged into the code. The normalization factors from TMM and DSS are highly correlated (>0.999). Also the test statistics from using either factors are also highly correlated:

cor(stat.TMM, stat.default) [1] 0.9999965

It's mysterious for me why this little difference will cause problem in locfdr. To be specific, using TMM normalization factor result in following error in locfdr:

3: In locfdr(stat.TMM, plot = 0) : CM estimation failed, middle of histogram non-normal

At this moment, I don't know how to fix it because I will have to look into locfdr. In the meantime, I suggest you use the "default", since the results are essentially the same.

Hao

aedanr commented 4 years ago

Hi Hao,

Thanks for looking into this so quickly. I can use the built-in normalisation in DSS instead of supplying TMM factors for now, but I'm also trying to compare different normalisation methods, e.g. using calcNormFactors(rawdata, method="RLE"), which also gives the same error but doesn't have an equivalent built in to DSS as far as I know, so it would be good to be able to supply these normalisation factors.

Thanks, Aedan

haowulab commented 4 years ago

Aedan,

DSS does take other normalization factors, like what you did in newSeqCountSet. It usually works fine based on my previous tests. For whatever reason, it generates this mysterious error from this particular dataset. As I said, if you dive into the code you'll find that using TMM or DSS normalization gives very similar test statistics, cor= 0.9999965.

One possible solution on my side is to change the way to compute FDR. Currently DSS uses local FDR, which seems to be unstable sometimes. I can switch to tradition FDR if that fails. But this will take me a little time. I'm currently too busy with all other responsibilities including the grant deadline, phd admission, faculty search, etc. I probably can work on this after two weeks. In the meantime, maybe you can try another dataset?

Hao

aedanr commented 4 years ago

No problem, I can work on other datasets for now and come back to this one later.

aedanr commented 4 years ago

Hi Hao,

An update on this issue - it's now happening with one of my simulated datasets even when using the normalisation built in to DSS:

https://github.com/aedanr/PhD-Project-2.3/blob/master/raw.counts.DEDD50.3.rds https://github.com/aedanr/PhD-Project-2.3/blob/master/troubleshooting_DSS_issue.R

haowulab commented 4 years ago

Yeah this comes from localFDR computation, which is something I cannot fix. As I said, I'll change the way to compute FDR. Will let you know once I have it. I'm currently very busy in many other responsibilities.