ENH: Principal components analysis

alimanfoo commented 5 years ago

Proposed to add principal components analysis functions.

Implementation plan

Add skallel_stats.decomposition package.
Add skallel_stats.decomposition.api module.
Add pca() public API function.
Add randomized_pca() public API function.
Add dispatch functions dispatch_pca and dispatch_randomized_pca.
Add numpy_backend.
Add dask_backend.

Notes

This is a porting and refactoring of functionality from scikit-allel version 1.x. See pca() and randomized_pca().

N.B., here it is proposed not to include the scaling preprocessing operation within the PCA implementation. Rather we leave that as a separate function (xref #9) which the user has to call themselves. E.g.:

x = ...  # data to analyse
x_scaled = skallel.scale_standard(x)
coords, loadings, evr = skallel.pca(x_scaled)

alimanfoo commented 5 years ago

Note that the scikit-allel v1.x implementation follows the scikit-learn approach internally, implementing classes with fit(), transform() and fit_transform() methods. Here perhaps an initial implementation could drop that and just effectively implement the fitting and transforming within the main pca() function. I.e., have a signature like:

def pca(x):
    """
    Perform PCA.

    Parameters
    ----------
    x : array_like, 2 dimensional

    Returns
    -------
    coords
    loadings
    explained_variance_ratio
   """

In particular, it's not immediately obvious how the separate fit() and transform() steps would work with dispatching to multiple backends. Although possibly we could have dispatch functions for each, i.e., dispatch_pca_fit_transform(), dispatch_pca_fit(), dispatch_pca_transform().

alimanfoo commented 5 years ago

Here's a gist with the transposition worked out so we don't have to transpose the input array (as scikit-allel version 1 does).

scikit-allel / skallel-stats

ENH: Principal components analysis #10

Implementation plan

Notes