Dimensionality reduction (DR) is critical in scaling machine learning pipelines: by reducing input dimensionality in exchange for a preprocessing overhead, DR enables faster endto-end runtime. Principal component analysis (PCA) is a DR standard, but can be computationally expensive: classically O(dn2 + n3) for an n-dimensional dataset of d points. Theoretical work has optimized PCA via iterative, sample-based stochastic methods. However, these methods execute for a
fixed number of iterations or to convergence, sampling too many or too few datapoints for end-to-end runtime improvements. We show how accounting for downstream analytics operations during DR via PCA allows stochastic methods to efficiently terminate after processing small (e.g., 1%) samples of data. Leveraging this, we propose DROP, a DR optimizer that enables speedups of up to 5× over Singular-ValueDecomposition (SVD)-based PCA, and 16× over conventional
DR methods in end-to-end nearest neighbor workloads.
DROP: A Workload-Aware Optimizer for Dimensionality Reduction
Abstract: