Open LTLA opened 4 years ago
Hi Aaron, thanks for the feedback. The problems with numerical stability for large L and tuning issues of the penalty (+ computational speed) are all known issues with glmpca we are actively working on. I am pretty close to having something ready to release that scales better for larger L values. I am now testing it to see if it can work with large numbers of cells as well. I will post something here when I release it to CRAN, and any improvements should pass through to the scry interface automatically. Also I agree with all of your other suggestions. Sorry for the frustrations!
by the way, we would recommend using the deviance for gene filtering rather than hvgs
On what basis?
it's more consistent with the multinomial model behind glm-pca. Also it gave better empirical results for eg clustering: https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1861-6/figures/5 , see also more comprehensive comparison by Germain et al where they recommend deviance for filtering: https://www.biorxiv.org/content/10.1101/2020.02.02.930578v3
However, this is totally optional and doesn't detract from your main points. You can use any filtering you choose as input to GLM-PCA. It's a separate issue from the speed and numerical problems.
Seems like it would be tricky to choose an appropriate dispersion for the deviance. If you used a Poisson GLM, you'd just end up with a strong mean-dependent trend if the true dispersion was a constant non-zero value across all genes. Unless you're doing your own dispersion estimation, but then how would you do that before you've accounted for the structure in the GLM-PCA?
Yes, I also did read that paper. But they recommended imputation as well and I didn't know what to make of that, so I just left it alone and moved on.
Hi Aaron, glmpca 0.2.0 has now been released. It should be faster and more numerically stable. Please try running your experiments again and let me know if things improve or if there are further problems. Thanks!
Excellent. I will give it a spin sometime during the week.
It finishes, which is good. The example in my original post took 262 seconds, which is... tolerable, for the purposes of putting it in the book. I haven't actually tested the output yet to see whether it solves the problem that I was hoping it would solve.
I'll test out the current state, which is definitely better than what it was before. Hopefully it turns out to do what I wanted; but if it does, it'll need to be a fair bit faster for mainstream use. Maybe a minute for 10k cells, which sounds pretty reasonable.
There's probably some low-hanging fruit with BiocParallelization and DelayedArraying, especially if you can make use of the parallelized crossprod
provided by DelayedArray (via the Matrix generic). Or maybe there's some minibatching or some other approximation that I'm missing? I do notice a glmpca:::create_minibatches
when I was poking around in there.
Yes, minibatch is now supported. You can do minibatch='memoized'
to use cached sufficient statistics to compute gradients or minibatch='stochastic'
for stochastic gradient. The advantage of those methods is to save on memory, not necessarily to be faster, but anecdotally I have seen the stochastic gradient converge faster than full gradient (the default minibatch='none'). The problem with the stochastic gradient is it's hard to assess convergence due to the noise in the estimator, which causes it to often run for many excess iterations after it's already effectively converged. To set the minibatch size use ctl=list(batch_size=1000)
.
Also, for the Zeisel dataset it might be worth trying fam='nb'
instead of the default fam='poi'
because there are a couple of genes that have really huge counts compared to all the others, and negative binomial likelihood is a bit more numerically stable than Poisson in that scenario.
Adding support for DelayedArray/ HDF5Array is on my to-do list ( willtownes/glmpca#18 ). Thanks for the suggestion about parallelizing the crossprod operation ( now included as willtownes/glmpca#22 )
This is the call I've currently got in the book:
sce.scry <- GLMPCA(sce.zeisel[top.hvgs,], fam="nb",
L=20, sz=sizeFactors(sce.zeisel))
One improvement would be to offer a subsetting argument so that I don't have to subset the SCE as input. This would ensure that the output SCE has the same dimensionality, with the only difference being a new redDim entry.
Another improvement would be to provide a SCE method where sz=sizeFactors(object)
by default.
The function is still achingly slow, though. I wonder whether there is more that can be done to speed it up. For example, using the PCA on the log-values to provide the initial U
and V
when init=
is not supplied, maybe that would get start you closer to the target space. Or sharing information across iterations with different learning rates.
(I don't really understand why multiple learning rates are even required; I'm assuming the learning rate is somehow related to the step size per iteration, but surely this could be dynamically adjusted within a run rather than restarting the run with a different rate? We use a similar approach in edgeR to guarantee that each step decreases deviance.)
thanks for these good suggestions.
Consider the following stretch of code:
The most obvious issue is that I can't get it to work,
GLMPCA
conks out with:Increasing the
penalty
to 100 causes it to run for a while. It's been 10 minutes at least; by comparison, running the usual approximate PCA on the equivalent log-expression matrix would take 1 second, maybe 2. I know it's doing more computational work than a vanilla PCA but this seems even slower than zinbwave. It's only a 3k x 1k matrix here, anything above 30 seconds would be too much.There are also some other quality-of-life issues that you might consider addressing:
subset=
argument toGLMPCA()
so that we don't have to row-subset the entire SCE in the input, especially when only one of the matrices is of interest.SingleCellExperiment
method that defaultssz=sizeFactors(x)
.Session information
``` R version 4.0.0 Patched (2020-05-01 r78341) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.4 LTS Matrix products: default BLAS: /home/luna/Software/R/R-4-0-branch-dev/lib/libRblas.so LAPACK: /home/luna/Software/R/R-4-0-branch-dev/lib/libRlapack.so locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets [8] methods base other attached packages: [1] scRNAseq_2.3.2 scuttle_0.99.8 [3] scran_1.17.0 SingleCellExperiment_1.11.2 [5] SummarizedExperiment_1.19.4 DelayedArray_0.15.1 [7] matrixStats_0.56.0 Biobase_2.49.0 [9] GenomicRanges_1.41.1 GenomeInfoDb_1.25.0 [11] IRanges_2.23.5 S4Vectors_0.27.7 [13] BiocGenerics_0.35.2 scry_1.1.0 loaded via a namespace (and not attached): [1] httr_1.4.1 edgeR_3.31.0 [3] BiocSingular_1.5.0 bit64_0.9-7 [5] AnnotationHub_2.21.0 DelayedMatrixStats_1.11.0 [7] shiny_1.4.0.2 assertthat_0.2.1 [9] interactiveDisplayBase_1.27.2 statmod_1.4.34 [11] BiocManager_1.30.10 BiocFileCache_1.13.0 [13] dqrng_0.2.1 blob_1.2.1 [15] GenomeInfoDbData_1.2.3 yaml_2.2.1 [17] BiocVersion_3.12.0 pillar_1.4.4 [19] RSQLite_2.2.0 lattice_0.20-41 [21] glue_1.4.1 limma_3.45.0 [23] digest_0.6.25 promises_1.1.0 [25] XVector_0.29.1 htmltools_0.4.0 [27] httpuv_1.5.2 Matrix_1.2-18 [29] pkgconfig_2.0.3 zlibbioc_1.35.0 [31] purrr_0.3.4 xtable_1.8-4 [33] glmpca_0.1.0 later_1.0.0 [35] BiocParallel_1.23.0 tibble_3.0.1 [37] ellipsis_0.3.1 magrittr_1.5 [39] crayon_1.3.4 mime_0.9 [41] memoise_1.1.0 tools_4.0.0 [43] lifecycle_0.2.0 locfit_1.5-9.4 [45] irlba_2.3.3 AnnotationDbi_1.51.0 [47] compiler_4.0.0 rsvd_1.0.3 [49] rlang_0.4.6 grid_4.0.0 [51] RCurl_1.98-1.2 BiocNeighbors_1.7.0 [53] rappdirs_0.3.1 igraph_1.2.5 [55] bitops_1.0-6 ExperimentHub_1.15.0 [57] DBI_1.1.0 curl_4.3 [59] R6_2.4.1 dplyr_0.8.5 [61] fastmap_1.0.1 bit_1.1-15.2 [63] Rcpp_1.0.4.6 vctrs_0.3.0 [65] dbplyr_1.4.3 tidyselect_1.1.0 ```