hredestig / pcaMethods

Perform PCA on data with missing values in R
GNU General Public License v2.0
45 stars 10 forks source link

WIP irlba method for sparse matrices #8

Open flying-sheep opened 5 years ago

flying-sheep commented 5 years ago

Needs docs and a decision if this is the way to proceed or if we need to make prep sparse-friendly.

Fixes #7

hredestig commented 5 years ago

Looks great so far! Old project this and sadly lacking unit-tests but will try it out over weekend.

It's very long time since I worked with sparse data but I guess those that do have good tools for doing so already and so wonder if making prep sparse-friendly really adds value to anyone(?) I like your current solution

flying-sheep commented 5 years ago

These days there’s a lot of sparse single cell transcriptomics data, since current methods both produces huge amounts of data (e.g. 20k genes × 100k cells) but suffers from a lot of dropout (0 instead of small values).

Using PCA as a preprocessing step speeds up things and saves memory – if the PCA method can handle sparse data, that is.

hredestig commented 5 years ago

After looking at this more carefully I note this is more complicated than it might first seem. Calling prep like you suggested isn't good since the center and scale vectors are used later but are then not returned by prcomp_irlba. I fixed that (not entirely sure it's sparse aware but done the same way that irlba does it) in my irlba branch https://github.com/hredestig/pcaMethods/tree/irlba. But then realize that also fitted and predict must be made sparse aware :/

Wanna have look at that?