Open prabhakaranm opened 5 years ago
Hi! Can you motivate this? What is it good for, why would we want it, would it be something super essential or something for scanpy.external
?
It's a more central preprocessing method for selection PCs for further analysis. It's an automated alternative to just using the elbow of the PC variance explained plot.
We’re just using n_pcs=50 by default right? not even the elbow…
yes... so this method would augment scanpy, but it's probably not crucial to make an analysis script work.
I’m a big fan of good heuristics providing defaults. Would it be a useful default for that or where would it be used?
Tbh I haven't compared enough to make a call on whether it should be default. It is definitely a more sophisticated approach than just taking top 50. But top 50 has worked okay for us so far. I would say: include it if it's not too much work. Let's discuss defaults later.
There is another related paper which focus on how to chose the number of PCs and clusters, I have tried it yet, but failed to tell whether it makes sense. https://academic.oup.com/gigascience/article/8/10/giz121/5579995
This sounds interesting. If the performance is acceptable, it might make a good addition.
It's more along the line of selecting number of PCs for denoising but MCV (https://github.com/czbiohub/molecular-cross-validation) is also interesting here. It helps with hyperparameter selection based on reconstruction loss on the hold out "molecules".
Yeah, I assumed it might be useful for other algorithms that use a variable number of PCs as input as well.
From the methods of the paper mentioned by @wangjiawen2013:
our results were not sensitive to the default values of nPC_max
which reinforces my thinking that overshooting the number of PCs isn't a problem for typical clustering and visualization purposes. For interpreting the variable loadings, some selection might be helpful. I'd definitely be interested in having methods like these for use with other latent variable methods.
Also that MCV paper's Figure 2b should probably have the APOE axis share a scale, maybe by removing the cell that has ~twice the APOE log expression of any others. I'd be interested in seeing how different the plots look after that.
Guess there is no magic deciding the right number for PCs.
I recently had a discussion with @LisaSikkema about how much you can overshoot here. The suggestion was made to use 100 PCs. Can we robustly compute that many or do the numerical methods break down when too little variance is represented by a PC? I recall @falexwolf mentioning this at some point.
Another solution is to use Seurat to do jackstraw and clustering and then import into Scanpy for other analysis
This is one way where Scanpy and Seurat diverge strikingly. Even the 'basic' Seurat tutorial uses JackStraw and has a whole section on "Determine the ‘dimensionality’ of the dataset", while the Scanpy tutorial doesen't mention it. Can anyone explain/show literature on if/why the scanpy default of 50 PCs works well? Is it correct to say that the each embedded PC is given equal weight in the neighbourhood graph?
@chris-rands my two cents here: Using two many PCs (50 or 100) might not be such a good idea because most PCs above the first 15-20 (rough figure) are likely to represent non-biological variability (e.g., batch effects), so their inclusion in the analysis might lead to the identification of clusters that are not biologically-relevant but are rather technical noise. That's why it is recommended to use either a Jackstraw plot or an elbow plot to identify the optimal number of PCs to avoid including PCs that correspond to technical variation rather than biological heterogeneity.
Therefore, overshooting the number of PCs used from 10 to 20 is not that big a deal, but using 50 is NOT ideal by any means.
@chris-rands, that section of the Seurat tutorial also ends in these points:
- We encourage users to repeat downstream analyses with a different number of PCs (10, 15, or even 50!). As you will observe, the results often do not differ dramatically.
- We advise users to err on the higher side when choosing this parameter. For example, performing downstream analyses with only 5 PCs does significantly and adversely affect results.
Can anyone explain/show literature on if/why the scanpy default of 50 PCs works well?
I think there is enough to show that PCA works well. I'm not sure if I can show you a paper that says either choosing a high cutoff, or using jackstraw/ elbow plots gives better downstream results. I'd note that the cited paper for the Seurat tutorial doesn't seem to evaluate this.
@wolf5996, I'm not sure I agree with your point that lower PCs are more likely to contain non biological variability. I don't think that a component which explains more variability in the dataset would necessarily represent biological variability. As an example, if we have a dataset with two evenly sized batches, and a rare cell type which makes up ~1% of the population, wouldn't a PC representing the batch explain much more variability than a PC corresponding to the rare cell type?
Anecdotally, I can say batch effects can show up in high principal components.
Just to add to @ivirshup's points. There are several examples of lower PCs containing batch effects rather than higher PCs. I've seen this many times, but this has also been report for e.g., ATAC data in the SCALE paper.
Is it correct to say that the each embedded PC is given equal weight in the neighbourhood graph?
I'm not entirely sure, but I don't think you can say this... higher PCs that explain less variance will contribute less to the total variance if you use them as an input to e.g., UMAP, t-SNE, or a kNN graph building algorithm. This is because the variance of the loadings is proportional to the total variance explained (unless a rescaling is used in scanpy by default?). Thus, the contribution of higher PCs to the distance calculations will be less discriminative between points.
Putting these two aspects together, you can see exactly why you need batch integration methods. These effects affect leading PCs, and therefore contribute a lot to any distance calculation based on an embedding. You can't just remove the effects by filtering for only leading PCs.
Hi,
In the scanpy, has anyone tried implementing jackstraw using anndata? If anyone has written a code to find the significant PCs in scanpy, please do share or any guide to perform it would be greatly appreciated! Thanks so much