Running a PCA on the whole of the Ag3 dataset is fairly challenging currently. There are three main sections to the computation:
Scan genotypes to compute allele counts - allows handling of max_missing_an and min_minor_ac parameters.
Scan genotypes and prepare biallelic diplotypes - this is the data that the PCA will actually run on.
Run the SVD.
The first two steps can be run on a dask cluster, which helps to scale out the computation. However, step 3 currently runs an SVD via scipy which is in-core. The computation is parallelised over threads via the linear algebra backend (e.g., blas) but I had to use a machine with 64 vCPUs to get the SVD to finish in reasonable time (~13 minutes).
Dask also has several out-of-core implementations of SVD, and sgkit implements some of these too, so we could add support for this somehow.
Also related to the limit on the number of elements supported via in-core SVD described here. I.e., if SVD is computed via dask then this limit would probably not apply.
Running a PCA on the whole of the Ag3 dataset is fairly challenging currently. There are three main sections to the computation:
max_missing_an
andmin_minor_ac
parameters.The first two steps can be run on a dask cluster, which helps to scale out the computation. However, step 3 currently runs an SVD via scipy which is in-core. The computation is parallelised over threads via the linear algebra backend (e.g., blas) but I had to use a machine with 64 vCPUs to get the SVD to finish in reasonable time (~13 minutes).
Dask also has several out-of-core implementations of SVD, and sgkit implements some of these too, so we could add support for this somehow.