laurenhsu1 / corral

Dimensionality reduction and batch integration methods for single cell data
16 stars 2 forks source link

Running corral on the Zeisel dataset #2

Open LTLA opened 4 years ago

LTLA commented 4 years ago

To inform this discussion, I'll use the Zeisel dataset as an example.

Setting up the object ```r #--- loading ---# library(scRNAseq) sce.zeisel <- ZeiselBrainData() library(scater) sce.zeisel <- aggregateAcrossFeatures(sce.zeisel, id=sub("_loc[0-9]+$", "", rownames(sce.zeisel))) #--- gene-annotation ---# library(org.Mm.eg.db) rowData(sce.zeisel)$Ensembl <- mapIds(org.Mm.eg.db, keys=rownames(sce.zeisel), keytype="SYMBOL", column="ENSEMBL") #--- quality-control ---# stats <- perCellQCMetrics(sce.zeisel, subsets=list( Mt=rowData(sce.zeisel)$featureType=="mito")) qc <- quickPerCellQC(stats, percent_subsets=c("altexps_ERCC_percent", "subsets_Mt_percent")) sce.zeisel <- sce.zeisel[,!qc$discard] #--- normalization ---# library(scran) set.seed(1000) clusters <- quickCluster(sce.zeisel) sce.zeisel <- computeSumFactors(sce.zeisel, cluster=clusters) sce.zeisel <- logNormCounts(sce.zeisel) #--- variance-modelling ---# dec.zeisel <- modelGeneVarWithSpikes(sce.zeisel, "ERCC") top.hvgs <- getTopHVGs(dec.zeisel, prop=0.1) ```
out <- corral_sce(sce.zeisel[top.hvgs,])

There's a couple of points here:

My actual aim is to try to see if corral can avoid the problems with PCA on the log-values mentioned here; I'd be interested in knowing if correspondence analysis provides any advantage for that scenario.

laurenhsu1 commented 4 years ago

I've just made some updates to the package which enable (1) row subset from full sce and (4) user-provided column weights: (1) The corral_sce function now accepts a subset_row parameter which will preserve the original sce and compute dim reduction with provided subset.  (4) col.w parameter can now be used to set column weights with the corral functions. In correspondence analysis the row and column weights are used to determine the expected value for each cell/gene combination. It seems reasonable to use the size factors rather than just HVG gene counts to set the expectations for the cells. I checked for a couple test datasets and the results are comparable, though I could imagine scenarios where they would differ.

These are available now in the github repo, and should be in bioconductor soon once I finish setting up git.

Regarding (2) and (3), thanks for raising these points, and for the suggestions. We're focused on putting the manuscript together at the moment, and once that's done, planning to update the package to enable larger datasets from HDF5 files, and at that point incorporate DelayedArray and BiocSingular. I am new to bioconductor, and would appreciate any other thoughts/suggestions re: ways to optimize performance.

Correspondence analysis is appropriate for use on counts directly, so it does sidestep the log transform issue. The output from corral is similar to glmpca, and @aedin put together a short explainer on correspondence analysis with that comparison here: https://aedin.github.io/PCAworkshop/articles/c_COA.html 

LTLA commented 4 years ago

Great, thanks. Once it gets onto BioC-devel, I will give it a spin to see whether it fixes the log-transform issue; or, if you're feeling up to it, you could try doing it yourself, and if it does work well, I'll insert it into the chapter right away.

laurenhsu1 commented 4 years ago

The updates are now in bioc-devel!

LTLA commented 4 years ago

Excellent. I'll try it out tomorrow evening, depending on the success of our latest attempt to build the book.

LTLA commented 4 years ago

Well, here's how it looks:

test

You can see how the uninteresting library size effect on PC2 is eliminated by corral.

However, this was quite unhappy:

sce.8qc <- corral_sce(sce.8qc, subset_row=hvgs.8qc, col.w=sizeFactors(sce.8qc))
## Error in sp_mat * sqrt(row.w) : 
##   length of 2nd arg does not match dimension of first

One also wonders whether col.w=sizeFactors(sce) could perhaps be set as the default for corral_sce.

I added a section to the book about this in Bioconductor/OrchestratingSingleCellAnalysis#38, you may wish to read it to see whether I accurately described the strengths and weakness of CA over PCA.

aedin commented 4 years ago

How does it compare to standard corral col.w. Is it dependent on the extent of the subset?

Sent from my iPhone

On Aug 15, 2020, at 4:07 AM, Aaron Lun notifications@github.com wrote:

 External Email - Use Caution

Well, here's how it looks:

You can see how the library size effect is now eliminated.

However, this was quite unhappy:

sce.8qc <- corral_sce(sce.8qc, subset_row=hvgs.8qc, col.w=sizeFactors(sce.8qc))

Error in sp_mat * sqrt(row.w) :

length of 2nd arg does not match dimension of first

One wonders whether col.w=sizeFactors(sce) could perhaps be set as the default for corral_sce.

I added a section to the book about this in Bioconductor/OrchestratingSingleCellAnalysis#38, you may wish to read it to see whether I accurately described the strengths and weakness of CA over PCA.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

laurenhsu1 commented 4 years ago

Thanks for flagging the col.w issue, it should be corrected in the latest version now. Apologies for the inconvenience and the delayed response.

I read through the passage you drafted, and I agree it's an accurate description, thanks. One bit I might suggest adding to the part where you showed the overdispersed gene dominating the first PC, would be to include a plot of the gene embeddings (PCv) on pc1 (or a biplot) to highlight relationship between the gene/cell embeddings and to illustrate another approach for diagnosing such an issue.

Regarding sizeFactors as default col.w - I'm a little wary of setting that as the default as any given SCE may not necessarily have the slot filled, and I think it could be confusing to have different behavior from the base method (on a matrix). I'll add a mention to the argument descriptions suggesting the usage of sizeFactors if they are available in the SCE.

@aedin - I believe the output shown above is using standard corral col.w, as there was a bug in the corral_sce function in my last update. 

LTLA commented 4 years ago

Behold.

One bit I might suggest adding to the part where you showed the overdispersed gene dominating the first PC, would be to include a plot of the gene embeddings (PCv) on pc1 (or a biplot) to highlight relationship between the gene/cell embeddings and to illustrate another approach for diagnosing such an issue.

Would this be better in your vignette? As you can see, I didn't make that particular code chunk visible (it's all commented out) because I didn't want to show a lot of simulated data in the book; it's mostly for myself to remember why I said it in the first place. I can use Biocpkg() to link to specific sections in your vignette if you give me some easy-to-remember anchors.

laurenhsu1 commented 4 years ago

It looks great, thanks!

Re what I mentioned before - I missed the echo/eval=F in that code chunk so I didn't realize that the simulated plot is not included. Since it's not, I agree this doesn't belong here. I'll send you the anchor names once I get around to adding it to the vignette.