Closed VolkerBergen closed 4 years ago
For pca of sparse matrices i think it should work this way
1) Create class for lazy evaluation of X - mean, i.e store X, mean separately and implement multiplication by some dense B as sparse.csr_matrix.dot(X, B) - mean.dot(B)
.
2) Pass instance of this class to randomized_svd
Most importantly: Other than stated in the docs, the default for zero_center
is True
and has been True
since I believe, July 2017 (very early Scanpy, maybe 0.2..).
@VolkerBergen: I concur. One should obtain the same representation independent of the data type and that's what the default behavior of the function should give you.
@Koncopd: One can also talk about a proper implementation of PCA for sparse data, which I thought would require quite some custom code. Your solution seems like a really good solution if randomiced_svd is able to treat that lazy evaluation object efficiently.
@VolkerBergen: I've viewed TruncatedSVD
as an alternative way of compressing the data. Of course, the first SVD component will then store all the information about the means. From component two on this alternative way should be similar to what you get from PCA, but yes, it's not equivalent... As I eventually didn't run into memory problems I never really investigated further... But I'm pretty sure that PCA is just one of 100 ways of compressing the data in a somewhat meaningful manner giving you somewhat meaningful results. That's already evident from the fact that all the autoencoder based latent space representations don't give you completely different results than PCA. My impression is that, in fact, the results are highly similar.
Ah, just saw that the docs are and have been correct. They state that the default is True
. I'm just still not used to the fact that the default value is not displayed right next to the parameter value but only in the function header...
And, you're right: line 486 could be a bug. It should be zero_center is None
and not zero_center is not None
. Hm, @Koncopd, could it be that the function does the opposite as wished? Did this happen when introducing the chunked
version a couple of months ago or was it present from the beginning? It would be quite a serious bug... And @VolkerBergen would be right in this observation... Damn.
The original line was
zero_center = zero_center if zero_center is not None else False if issparse(adata_comp.X) else True
which did the expected thing, @flying-sheep introduced the bug 22 days ago in https://github.com/theislab/scanpy/commit/ce10d02f58c3308b60c23c43a36949b6aeed3ea8.
Damn, I wouldn't have expected such a thing in a commit "improved docs". It went into release 1.3.4 and 1.3.5...
Of course, it's my fault. I should have written a test in the first place.
@Koncopd: can you write a test for PCA both for sparse and dense data?
Can't say i remember, but i don't see any reason for me to edit this block when i was adding the chunked computation...
Yes, i will.
I fixed the bug: https://github.com/theislab/scanpy/commit/15593d532fbaa696bf1ea328d1991d31b334e175.
And I'll immediately make a new release and put a warning on the webpage...
@Koncopd: Thank you for adding the tests!
Done with release 1.3.6 and the warning: https://github.com/theislab/scanpy/commit/35030d28bb4e1e4559449bfe41238523bee0e616
As for randomized_svd
, looking at this it seems that is should work properly if a class for lazy evaluation inherits from standard sparse class and implements __mul__ and __rmul__.
Following @Koncopd 's idea, wouldn't it be sufficient to simply have line 340 in https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/utils/extmath.py changed to
mu = M.mean(1).A1 if issparse(M) else M.mean(1)
B = safe_sparse_dot(Q.T, M) - safe_sparse_dot(Q.T, mu[:, None])
@VolkerBergen
No, similar lines also should be changed in randomized_range_finder
, which is used by randomized_svd
function. Or the whole safe_sparse_dot
function.
But copying the file extmath.py (or the part of it related to svd) and changing this lines would be sufficient, yes.
which did the expected thing, @flying-sheep introduced the bug 22 days ago in ce10d02.
damn, the only thing I could have done wrong there…
It went into that commit because the previous code was too convoluted to understand, and I needed to understand that line to improve the docs! I ended up understanding it it but rewrote the line incorrectly. I’m sorry!
Did you add a test after 15593d5?
@VolkerBergen Also appropriately implemented power method (last section of this, for example) for svd should be fine.
@flying-sheep no worries! We'll steadily increase test coverage. I assume that almost no one should have run into the bug in the past 22 days. Among those that updated their version, only very few will have run the PCA with sparse data...
@Koncopd, I'm very happy if you move forward with a proper sparse implementation of PCA! :smile:
Ok, it will be here https://github.com/theislab/scanpy/tree/pca_sparse
Closed by #1066
Right now we get different PCs when using sparse data (applying TruncatedSVD) compared to using dense data (applying PCA).
TruncatedSVD(X-X.mean(0))
would be equivalent toPCA(X)
.X-X.mean(0)
would obviously not be sparse anymore, which is why it is currently implemented asTruncatedSVD(X)
. The first PC will be mainly representing the vector of means, thus be very different from zero-centered PCA. The following components would approximately resemble PCA. However, since all subsequent PCs are orthogonal to the first PC, we will never get to the exact solution. Hence, the PCs are questionable, in particular when the very first ones are quite misleading.That's not desirable. I think we should obtain the same PCA representation regardless of the data type.
Don't we have to densify
X
at some point anyways, as we would have to computeX.dot(X.T)
. Thus it might be worth thinking of some EM approach?Whatsoever, I think as long the data is manageable and fits into the RAM, we should just use the densified
X
.Line 486 in preprocessing/simple I don't quite understand:
It doesn't depend on the actual value of the attribute
zero_center
anymore. Is that a bug, or what is the rationale behind this?For now, we can change that into something like