scikit-allel / skallel-stats

Statistical functions for genome variation data.
MIT License
3 stars 1 forks source link

ENH: Principal components analysis #10

Open alimanfoo opened 5 years ago

alimanfoo commented 5 years ago

Proposed to add principal components analysis functions.

Implementation plan

Notes

This is a porting and refactoring of functionality from scikit-allel version 1.x. See pca() and randomized_pca().

N.B., here it is proposed not to include the scaling preprocessing operation within the PCA implementation. Rather we leave that as a separate function (xref #9) which the user has to call themselves. E.g.:

x = ...  # data to analyse
x_scaled = skallel.scale_standard(x)
coords, loadings, evr = skallel.pca(x_scaled)
alimanfoo commented 5 years ago

Note that the scikit-allel v1.x implementation follows the scikit-learn approach internally, implementing classes with fit(), transform() and fit_transform() methods. Here perhaps an initial implementation could drop that and just effectively implement the fitting and transforming within the main pca() function. I.e., have a signature like:

def pca(x):
    """
    Perform PCA.

    Parameters
    ----------
    x : array_like, 2 dimensional

    Returns
    -------
    coords
    loadings
    explained_variance_ratio
   """

In particular, it's not immediately obvious how the separate fit() and transform() steps would work with dispatching to multiple backends. Although possibly we could have dispatch functions for each, i.e., dispatch_pca_fit_transform(), dispatch_pca_fit(), dispatch_pca_transform().

alimanfoo commented 5 years ago

Here's a gist with the transposition worked out so we don't have to transpose the input array (as scikit-allel version 1 does).