Out of core PCA - Githubissues

Running a PCA on the whole of the Ag3 dataset is fairly challenging currently. There are three main sections to the computation:

Scan genotypes to compute allele counts - allows handling of max_missing_an and min_minor_ac parameters.
Scan genotypes and prepare biallelic diplotypes - this is the data that the PCA will actually run on.
Run the SVD.

The first two steps can be run on a dask cluster, which helps to scale out the computation. However, step 3 currently runs an SVD via scipy which is in-core. The computation is parallelised over threads via the linear algebra backend (e.g., blas) but I had to use a machine with 64 vCPUs to get the SVD to finish in reasonable time (~13 minutes).

Dask also has several out-of-core implementations of SVD, and sgkit implements some of these too, so we could add support for this somehow.

malariagen / malariagen-data-python

Out of core PCA #607