Open LTLA opened 4 years ago
I've just made some updates to the package which enable (1) row subset from full sce and (4) user-provided column weights:
(1) The corral_sce function now accepts a subset_row
parameter which will preserve the original sce and compute dim reduction with provided subset.
(4) col.w
parameter can now be used to set column weights with the corral functions.
In correspondence analysis the row and column weights are used to determine the expected value for each cell/gene combination. It seems reasonable to use the size factors rather than just HVG gene counts to set the expectations for the cells. I checked for a couple test datasets and the results are comparable, though I could imagine scenarios where they would differ.
These are available now in the github repo, and should be in bioconductor soon once I finish setting up git.
Regarding (2) and (3), thanks for raising these points, and for the suggestions. We're focused on putting the manuscript together at the moment, and once that's done, planning to update the package to enable larger datasets from HDF5 files, and at that point incorporate DelayedArray and BiocSingular. I am new to bioconductor, and would appreciate any other thoughts/suggestions re: ways to optimize performance.
Correspondence analysis is appropriate for use on counts directly, so it does sidestep the log transform issue. The output from corral is similar to glmpca, and @aedin put together a short explainer on correspondence analysis with that comparison here: https://aedin.github.io/PCAworkshop/articles/c_COA.html
Great, thanks. Once it gets onto BioC-devel, I will give it a spin to see whether it fixes the log-transform issue; or, if you're feeling up to it, you could try doing it yourself, and if it does work well, I'll insert it into the chapter right away.
The updates are now in bioc-devel!
Excellent. I'll try it out tomorrow evening, depending on the success of our latest attempt to build the book.
Well, here's how it looks:
You can see how the uninteresting library size effect on PC2 is eliminated by corral.
However, this was quite unhappy:
sce.8qc <- corral_sce(sce.8qc, subset_row=hvgs.8qc, col.w=sizeFactors(sce.8qc))
## Error in sp_mat * sqrt(row.w) :
## length of 2nd arg does not match dimension of first
One also wonders whether col.w=sizeFactors(sce)
could perhaps be set as the default for corral_sce
.
I added a section to the book about this in Bioconductor/OrchestratingSingleCellAnalysis#38, you may wish to read it to see whether I accurately described the strengths and weakness of CA over PCA.
How does it compare to standard corral col.w. Is it dependent on the extent of the subset?
Sent from my iPhone
On Aug 15, 2020, at 4:07 AM, Aaron Lun notifications@github.com wrote:
External Email - Use Caution
Well, here's how it looks:
You can see how the library size effect is now eliminated.
However, this was quite unhappy:
sce.8qc <- corral_sce(sce.8qc, subset_row=hvgs.8qc, col.w=sizeFactors(sce.8qc))
Error in sp_mat * sqrt(row.w) :
length of 2nd arg does not match dimension of first
One wonders whether col.w=sizeFactors(sce) could perhaps be set as the default for corral_sce.
I added a section to the book about this in Bioconductor/OrchestratingSingleCellAnalysis#38, you may wish to read it to see whether I accurately described the strengths and weakness of CA over PCA.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.
Thanks for flagging the col.w issue, it should be corrected in the latest version now. Apologies for the inconvenience and the delayed response.
I read through the passage you drafted, and I agree it's an accurate description, thanks. One bit I might suggest adding to the part where you showed the overdispersed gene dominating the first PC, would be to include a plot of the gene embeddings (PCv) on pc1 (or a biplot) to highlight relationship between the gene/cell embeddings and to illustrate another approach for diagnosing such an issue.
Regarding sizeFactors as default col.w - I'm a little wary of setting that as the default as any given SCE may not necessarily have the slot filled, and I think it could be confusing to have different behavior from the base method (on a matrix). I'll add a mention to the argument descriptions suggesting the usage of sizeFactors if they are available in the SCE.
@aedin - I believe the output shown above is using standard corral col.w, as there was a bug in the corral_sce function in my last update.
One bit I might suggest adding to the part where you showed the overdispersed gene dominating the first PC, would be to include a plot of the gene embeddings (PCv) on pc1 (or a biplot) to highlight relationship between the gene/cell embeddings and to illustrate another approach for diagnosing such an issue.
Would this be better in your vignette? As you can see, I didn't make that particular code chunk visible (it's all commented out) because I didn't want to show a lot of simulated data in the book; it's mostly for myself to remember why I said it in the first place. I can use Biocpkg()
to link to specific sections in your vignette if you give me some easy-to-remember anchors.
It looks great, thanks!
Re what I mentioned before - I missed the echo/eval=F in that code chunk so I didn't realize that the simulated plot is not included. Since it's not, I agree this doesn't belong here. I'll send you the anchor names once I get around to adding it to the vignette.
To inform this discussion, I'll use the Zeisel dataset as an example.
Setting up the object
```r #--- loading ---# library(scRNAseq) sce.zeisel <- ZeiselBrainData() library(scater) sce.zeisel <- aggregateAcrossFeatures(sce.zeisel, id=sub("_loc[0-9]+$", "", rownames(sce.zeisel))) #--- gene-annotation ---# library(org.Mm.eg.db) rowData(sce.zeisel)$Ensembl <- mapIds(org.Mm.eg.db, keys=rownames(sce.zeisel), keytype="SYMBOL", column="ENSEMBL") #--- quality-control ---# stats <- perCellQCMetrics(sce.zeisel, subsets=list( Mt=rowData(sce.zeisel)$featureType=="mito")) qc <- quickPerCellQC(stats, percent_subsets=c("altexps_ERCC_percent", "subsets_Mt_percent")) sce.zeisel <- sce.zeisel[,!qc$discard] #--- normalization ---# library(scran) set.seed(1000) clusters <- quickCluster(sce.zeisel) sce.zeisel <- computeSumFactors(sce.zeisel, cluster=clusters) sce.zeisel <- logNormCounts(sce.zeisel) #--- variance-modelling ---# dec.zeisel <- modelGeneVarWithSpikes(sce.zeisel, "ERCC") top.hvgs <- getTopHVGs(dec.zeisel, prop=0.1) ```There's a couple of points here:
subset.row=
argument would be useful to enable me to docorral_sce(sce.zeisel, subset.row=top.hvgs)
and return the full SCE but with the reduced dimensions computed from only the subset.irlba
,rsvd
andsvd
with a single parameter. Also provides some support for parallel processing during approximate SVD.corral.R
L71 will break sparsity; you might consider adding an option to wrap it in aDelayedArray
before subtraction so that large sparse matrices are never fully expanded.top.hvgs
. This has the effect of yielding a zero weight for some cells in which my HVGs are silent. Would the underlying procedure still be valid if I pass in size factors as the column weights?My actual aim is to try to see if corral can avoid the problems with PCA on the log-values mentioned here; I'd be interested in knowing if correspondence analysis provides any advantage for that scenario.