kstreet13 / scry

Collection of analysis methods for small count data generated by rafalab members (the methods, not the data).
19 stars 2 forks source link

Experiences from trying to get GLMPCA to run with the book #7

Open LTLA opened 4 years ago

LTLA commented 4 years ago

Consider the following stretch of code:

library(scRNAseq)
sce <- ZeiselBrainData()

library(scuttle)
sce <- logNormCounts(sce)

library(scran)
dec <- modelGeneVar(sce)
hvgs <- getTopHVGs(dec, n=1000)

library(scry)
out <- GLMPCA(sce[hvgs,], L=50, sz=sizeFactors(sce))

The most obvious issue is that I can't get it to work, GLMPCA conks out with:

Error in glmpca(Y = assay(object, assay), L = L, fam = fam, ctl = ctl,  : 
  Numerical divergence (deviance no longer finite), try increasing the penalty to improve stability of optimization.

Increasing the penalty to 100 causes it to run for a while. It's been 10 minutes at least; by comparison, running the usual approximate PCA on the equivalent log-expression matrix would take 1 second, maybe 2. I know it's doing more computational work than a vanilla PCA but this seems even slower than zinbwave. It's only a 3k x 1k matrix here, anything above 30 seconds would be too much.

There are also some other quality-of-life issues that you might consider addressing:

Session information ``` R version 4.0.0 Patched (2020-05-01 r78341) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.4 LTS Matrix products: default BLAS: /home/luna/Software/R/R-4-0-branch-dev/lib/libRblas.so LAPACK: /home/luna/Software/R/R-4-0-branch-dev/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] scRNAseq_2.3.2 scuttle_0.99.8 [3] scran_1.17.0 SingleCellExperiment_1.11.2 [5] SummarizedExperiment_1.19.4 DelayedArray_0.15.1 [7] matrixStats_0.56.0 Biobase_2.49.0 [9] GenomicRanges_1.41.1 GenomeInfoDb_1.25.0 [11] IRanges_2.23.5 S4Vectors_0.27.7 [13] BiocGenerics_0.35.2 scry_1.1.0 loaded via a namespace (and not attached): [1] httr_1.4.1 edgeR_3.31.0 [3] BiocSingular_1.5.0 bit64_0.9-7 [5] AnnotationHub_2.21.0 DelayedMatrixStats_1.11.0 [7] shiny_1.4.0.2 assertthat_0.2.1 [9] interactiveDisplayBase_1.27.2 statmod_1.4.34 [11] BiocManager_1.30.10 BiocFileCache_1.13.0 [13] dqrng_0.2.1 blob_1.2.1 [15] GenomeInfoDbData_1.2.3 yaml_2.2.1 [17] BiocVersion_3.12.0 pillar_1.4.4 [19] RSQLite_2.2.0 lattice_0.20-41 [21] glue_1.4.1 limma_3.45.0 [23] digest_0.6.25 promises_1.1.0 [25] XVector_0.29.1 htmltools_0.4.0 [27] httpuv_1.5.2 Matrix_1.2-18 [29] pkgconfig_2.0.3 zlibbioc_1.35.0 [31] purrr_0.3.4 xtable_1.8-4 [33] glmpca_0.1.0 later_1.0.0 [35] BiocParallel_1.23.0 tibble_3.0.1 [37] ellipsis_0.3.1 magrittr_1.5 [39] crayon_1.3.4 mime_0.9 [41] memoise_1.1.0 tools_4.0.0 [43] lifecycle_0.2.0 locfit_1.5-9.4 [45] irlba_2.3.3 AnnotationDbi_1.51.0 [47] compiler_4.0.0 rsvd_1.0.3 [49] rlang_0.4.6 grid_4.0.0 [51] RCurl_1.98-1.2 BiocNeighbors_1.7.0 [53] rappdirs_0.3.1 igraph_1.2.5 [55] bitops_1.0-6 ExperimentHub_1.15.0 [57] DBI_1.1.0 curl_4.3 [59] R6_2.4.1 dplyr_0.8.5 [61] fastmap_1.0.1 bit_1.1-15.2 [63] Rcpp_1.0.4.6 vctrs_0.3.0 [65] dbplyr_1.4.3 tidyselect_1.1.0 ```
willtownes commented 4 years ago

Hi Aaron, thanks for the feedback. The problems with numerical stability for large L and tuning issues of the penalty (+ computational speed) are all known issues with glmpca we are actively working on. I am pretty close to having something ready to release that scales better for larger L values. I am now testing it to see if it can work with large numbers of cells as well. I will post something here when I release it to CRAN, and any improvements should pass through to the scry interface automatically. Also I agree with all of your other suggestions. Sorry for the frustrations!

willtownes commented 4 years ago

by the way, we would recommend using the deviance for gene filtering rather than hvgs

LTLA commented 4 years ago

On what basis?

willtownes commented 4 years ago

it's more consistent with the multinomial model behind glm-pca. Also it gave better empirical results for eg clustering: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1861-6/figures/5 , see also more comprehensive comparison by Germain et al where they recommend deviance for filtering: https://www.biorxiv.org/content/10.1101/2020.02.02.930578v3

However, this is totally optional and doesn't detract from your main points. You can use any filtering you choose as input to GLM-PCA. It's a separate issue from the speed and numerical problems.

LTLA commented 4 years ago

Seems like it would be tricky to choose an appropriate dispersion for the deviance. If you used a Poisson GLM, you'd just end up with a strong mean-dependent trend if the true dispersion was a constant non-zero value across all genes. Unless you're doing your own dispersion estimation, but then how would you do that before you've accounted for the structure in the GLM-PCA?

Yes, I also did read that paper. But they recommended imputation as well and I didn't know what to make of that, so I just left it alone and moved on.

willtownes commented 4 years ago

Hi Aaron, glmpca 0.2.0 has now been released. It should be faster and more numerically stable. Please try running your experiments again and let me know if things improve or if there are further problems. Thanks!

LTLA commented 4 years ago

Excellent. I will give it a spin sometime during the week.

LTLA commented 4 years ago

It finishes, which is good. The example in my original post took 262 seconds, which is... tolerable, for the purposes of putting it in the book. I haven't actually tested the output yet to see whether it solves the problem that I was hoping it would solve.

I'll test out the current state, which is definitely better than what it was before. Hopefully it turns out to do what I wanted; but if it does, it'll need to be a fair bit faster for mainstream use. Maybe a minute for 10k cells, which sounds pretty reasonable.

There's probably some low-hanging fruit with BiocParallelization and DelayedArraying, especially if you can make use of the parallelized crossprod provided by DelayedArray (via the Matrix generic). Or maybe there's some minibatching or some other approximation that I'm missing? I do notice a glmpca:::create_minibatches when I was poking around in there.

willtownes commented 4 years ago

Yes, minibatch is now supported. You can do minibatch='memoized' to use cached sufficient statistics to compute gradients or minibatch='stochastic' for stochastic gradient. The advantage of those methods is to save on memory, not necessarily to be faster, but anecdotally I have seen the stochastic gradient converge faster than full gradient (the default minibatch='none'). The problem with the stochastic gradient is it's hard to assess convergence due to the noise in the estimator, which causes it to often run for many excess iterations after it's already effectively converged. To set the minibatch size use ctl=list(batch_size=1000).

Also, for the Zeisel dataset it might be worth trying fam='nb' instead of the default fam='poi' because there are a couple of genes that have really huge counts compared to all the others, and negative binomial likelihood is a bit more numerically stable than Poisson in that scenario.

Adding support for DelayedArray/ HDF5Array is on my to-do list ( willtownes/glmpca#18 ). Thanks for the suggestion about parallelizing the crossprod operation ( now included as willtownes/glmpca#22 )

LTLA commented 4 years ago

This is the call I've currently got in the book:

sce.scry <- GLMPCA(sce.zeisel[top.hvgs,], fam="nb", 
    L=20, sz=sizeFactors(sce.zeisel))

One improvement would be to offer a subsetting argument so that I don't have to subset the SCE as input. This would ensure that the output SCE has the same dimensionality, with the only difference being a new redDim entry.

Another improvement would be to provide a SCE method where sz=sizeFactors(object) by default.

The function is still achingly slow, though. I wonder whether there is more that can be done to speed it up. For example, using the PCA on the log-values to provide the initial U and V when init= is not supplied, maybe that would get start you closer to the target space. Or sharing information across iterations with different learning rates.

(I don't really understand why multiple learning rates are even required; I'm assuming the learning rate is somehow related to the step size per iteration, but surely this could be dynamically adjusted within a run rather than restarting the run with a different rate? We use a similar approach in edgeR to guarantee that each step decreases deviance.)

willtownes commented 4 years ago

thanks for these good suggestions.