reconstrue / single_cell

Single cell analysis tools built to run on Jupyter, especially Colab
http://reconstrue.com
Apache License 2.0
1 stars 0 forks source link

PCA issues #18

Open JohnTigue opened 4 years ago

JohnTigue commented 4 years ago

DROP: A Workload-Aware Optimizer for Dimensionality Reduction

Abstract:

Dimensionality reduction (DR) is critical in scaling machine learning pipelines: by reducing input dimensionality in exchange for a preprocessing overhead, DR enables faster endto-end runtime. Principal component analysis (PCA) is a DR standard, but can be computationally expensive: classically O(dn2 + n3) for an n-dimensional dataset of d points. Theoretical work has optimized PCA via iterative, sample-based stochastic methods. However, these methods execute for a fixed number of iterations or to convergence, sampling too many or too few datapoints for end-to-end runtime improvements. We show how accounting for downstream analytics operations during DR via PCA allows stochastic methods to efficiently terminate after processing small (e.g., 1%) samples of data. Leveraging this, we propose DROP, a DR optimizer that enables speedups of up to 5× over Singular-ValueDecomposition (SVD)-based PCA, and 16× over conventional DR methods in end-to-end nearest neighbor workloads.

JohnTigue commented 4 years ago

Rapids cuML has PCA on GPU: https://devblogs.nvidia.com/gpu-accelerated-analytics-rapids/