scverse / scanpy

Single-cell analysis in Python. Scales to >1M cells.
https://scanpy.readthedocs.io
BSD 3-Clause "New" or "Revised" License
1.9k stars 597 forks source link

PCA with sparse data #393

Closed VolkerBergen closed 4 years ago

VolkerBergen commented 5 years ago

Right now we get different PCs when using sparse data (applying TruncatedSVD) compared to using dense data (applying PCA).

TruncatedSVD(X-X.mean(0)) would be equivalent to PCA(X). X-X.mean(0) would obviously not be sparse anymore, which is why it is currently implemented as TruncatedSVD(X). The first PC will be mainly representing the vector of means, thus be very different from zero-centered PCA. The following components would approximately resemble PCA. However, since all subsequent PCs are orthogonal to the first PC, we will never get to the exact solution. Hence, the PCs are questionable, in particular when the very first ones are quite misleading.

That's not desirable. I think we should obtain the same PCA representation regardless of the data type.

Don't we have to densify X at some point anyways, as we would have to compute X.dot(X.T). Thus it might be worth thinking of some EM approach?

Whatsoever, I think as long the data is manageable and fits into the RAM, we should just use the densified X.

Line 486 in preprocessing/simple I don't quite understand:

if zero_center is not None:
       zero_center = not issparse(adata_comp.X)

It doesn't depend on the actual value of the attribute zero_center anymore. Is that a bug, or what is the rationale behind this?

For now, we can change that into something like

zero_center = zero_center if zero_center is not None else False if issparse(adata.X) and adata.X.shape[0] > 1e4 else True
Koncopd commented 5 years ago

For pca of sparse matrices i think it should work this way 1) Create class for lazy evaluation of X - mean, i.e store X, mean separately and implement multiplication by some dense B as sparse.csr_matrix.dot(X, B) - mean.dot(B). 2) Pass instance of this class to randomized_svd

falexwolf commented 5 years ago

Most importantly: Other than stated in the docs, the default for zero_center is True and has been True since I believe, July 2017 (very early Scanpy, maybe 0.2..).

@VolkerBergen: I concur. One should obtain the same representation independent of the data type and that's what the default behavior of the function should give you.

@Koncopd: One can also talk about a proper implementation of PCA for sparse data, which I thought would require quite some custom code. Your solution seems like a really good solution if randomiced_svd is able to treat that lazy evaluation object efficiently.

@VolkerBergen: I've viewed TruncatedSVD as an alternative way of compressing the data. Of course, the first SVD component will then store all the information about the means. From component two on this alternative way should be similar to what you get from PCA, but yes, it's not equivalent... As I eventually didn't run into memory problems I never really investigated further... But I'm pretty sure that PCA is just one of 100 ways of compressing the data in a somewhat meaningful manner giving you somewhat meaningful results. That's already evident from the fact that all the autoencoder based latent space representations don't give you completely different results than PCA. My impression is that, in fact, the results are highly similar.

falexwolf commented 5 years ago

Ah, just saw that the docs are and have been correct. They state that the default is True. I'm just still not used to the fact that the default value is not displayed right next to the parameter value but only in the function header...

falexwolf commented 5 years ago

And, you're right: line 486 could be a bug. It should be zero_center is None and not zero_center is not None. Hm, @Koncopd, could it be that the function does the opposite as wished? Did this happen when introducing the chunked version a couple of months ago or was it present from the beginning? It would be quite a serious bug... And @VolkerBergen would be right in this observation... Damn.

falexwolf commented 5 years ago

The original line was

zero_center = zero_center if zero_center is not None else False if issparse(adata_comp.X) else True

which did the expected thing, @flying-sheep introduced the bug 22 days ago in https://github.com/theislab/scanpy/commit/ce10d02f58c3308b60c23c43a36949b6aeed3ea8.

Damn, I wouldn't have expected such a thing in a commit "improved docs". It went into release 1.3.4 and 1.3.5...

Of course, it's my fault. I should have written a test in the first place.

@Koncopd: can you write a test for PCA both for sparse and dense data?

Koncopd commented 5 years ago

Can't say i remember, but i don't see any reason for me to edit this block when i was adding the chunked computation...

Yes, i will.

falexwolf commented 5 years ago

I fixed the bug: https://github.com/theislab/scanpy/commit/15593d532fbaa696bf1ea328d1991d31b334e175.

And I'll immediately make a new release and put a warning on the webpage...

@Koncopd: Thank you for adding the tests!

falexwolf commented 5 years ago

Done with release 1.3.6 and the warning: https://github.com/theislab/scanpy/commit/35030d28bb4e1e4559449bfe41238523bee0e616

Koncopd commented 5 years ago

As for randomized_svd, looking at this it seems that is should work properly if a class for lazy evaluation inherits from standard sparse class and implements __mul__ and __rmul__.

VolkerBergen commented 5 years ago

Following @Koncopd 's idea, wouldn't it be sufficient to simply have line 340 in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/extmath.py changed to

mu = M.mean(1).A1 if issparse(M) else M.mean(1)
B = safe_sparse_dot(Q.T, M) - safe_sparse_dot(Q.T, mu[:, None])
Koncopd commented 5 years ago

@VolkerBergen No, similar lines also should be changed in randomized_range_finder, which is used by randomized_svd function. Or the whole safe_sparse_dot function.

But copying the file extmath.py (or the part of it related to svd) and changing this lines would be sufficient, yes.

flying-sheep commented 5 years ago

which did the expected thing, @flying-sheep introduced the bug 22 days ago in ce10d02.

damn, the only thing I could have done wrong there…

It went into that commit because the previous code was too convoluted to understand, and I needed to understand that line to improve the docs! I ended up understanding it it but rewrote the line incorrectly. I’m sorry!

Did you add a test after 15593d5?

Koncopd commented 5 years ago

@VolkerBergen Also appropriately implemented power method (last section of this, for example) for svd should be fine.

falexwolf commented 5 years ago

@flying-sheep no worries! We'll steadily increase test coverage. I assume that almost no one should have run into the bug in the past 22 days. Among those that updated their version, only very few will have run the PCA with sparse data...

@Koncopd, I'm very happy if you move forward with a proper sparse implementation of PCA! :smile:

Koncopd commented 5 years ago

Ok, it will be here https://github.com/theislab/scanpy/tree/pca_sparse

ivirshup commented 4 years ago

Closed by #1066