malariagen / malariagen-data-python

Analyse MalariaGEN data from Python
https://malariagen.github.io/malariagen-data-python/latest/
MIT License
14 stars 24 forks source link

Out of core PCA #607

Open alimanfoo opened 2 months ago

alimanfoo commented 2 months ago

Running a PCA on the whole of the Ag3 dataset is fairly challenging currently. There are three main sections to the computation:

  1. Scan genotypes to compute allele counts - allows handling of max_missing_an and min_minor_ac parameters.
  2. Scan genotypes and prepare biallelic diplotypes - this is the data that the PCA will actually run on.
  3. Run the SVD.

The first two steps can be run on a dask cluster, which helps to scale out the computation. However, step 3 currently runs an SVD via scipy which is in-core. The computation is parallelised over threads via the linear algebra backend (e.g., blas) but I had to use a machine with 64 vCPUs to get the SVD to finish in reasonable time (~13 minutes).

Dask also has several out-of-core implementations of SVD, and sgkit implements some of these too, so we could add support for this somehow.

alimanfoo commented 2 months ago

Also related to the limit on the number of elements supported via in-core SVD described here. I.e., if SVD is computed via dask then this limit would probably not apply.