satijalab / sctransform

R package for modeling single cell UMI expression data using regularized negative binomial regression
GNU General Public License v3.0
211 stars 33 forks source link

Question about Reproducing a Plot in the Paper Introducing the SCTransform (Hafemeister, C., & Satija, R.) #34

Closed sainadfensi closed 4 years ago

sainadfensi commented 5 years ago

Dear Team,

I am learning the single-cell RNA-seq with the Seurat package every day. Thank you for developing the sctransform to provide a novel statistical technique for normalization.

In the paper[1] about the sctransform, multiple statistical methods are applied to show the convincing effectiveness with loads of plots. In particular, I am interested in the figures in the paper which show the relationship between gene expression and the cell sequencing depth (Fig 1C, D and Fig 3A in the paper). However, when I try to repeat the plot by myself, the result turns out to be different. I believe that there must be some misunderstandings in the steps of plotting the figure. Therefore, it would be so kind if you could help me to go into details on the steps of plotting the trends in the data before and after normalization (Methods 4.4 in the paper).

The method described in the paper is: 4.4 Trends in the data before and after normalization [1] We grouped genes into six bins based on log10-transformed mean UMI count, using bins of equal width. To show the overall trends in the data, for every gene we fit the expression (UMI counts, scaled log-normalized expression, scaled Pearson residuals) as a function of log10-transformed mean UMI count using kernel regression (ksmooth function) with normal kernel and large bandwidth (20 times the size suggested by R function bw.SJ). For visualization, we only used the central 90% of cells based on total UMI. For every gene group, we show the expression range after smoothing from the first to third quartile at 200 equidistant cell UMI values

imageimage Figure 1. Figures that show the relationship between gene expression and the cell sequencing depth (Fig 1C, D and Fig 3A in the paper[1]).

Questions:

  1. If I understand it correctly, genes are grouped genes into six bins based on log10-transformed mean UMI count. Do we need to regroup genes after log-normalization or sctransforms?
  2. Is the x values for the kernel regression the log10-transformed total UMI count of each cell, which is, in other words, x = log10(PBMC$nCount_RNA)?
  3. For doing kernel regressing on different types of data, like log-normalized data and Pearson residuals, are they using the same values of x which is the original total UMI count of each cell, or the new total counts based on newly calculated data, for example, the PBMC$nCount_SCT for Pearson residuals?
  4. For plotting, Should I use the geom_smooth from ggplot2 to draw the colored region, or simply use the values of quartiles to mark the boundary of the region and filled it with a chosen color?
  5. Will the shape(/trend) of the curve change after doing the z-score ?
  6. Besides, I also notice that a small number of features are removed after the sctransform and I cannot find explanations about it online. Is there any automatic filtering in the sctransform?

I also put my R commands here:

knitr::opts_chunk$set(echo = TRUE)
library(Seurat)
library(sctransform)
library(dplyr)
library(ggplot2)
library(BiocParallel)

check the data in the paper The dataset used in the main text is ‘33k PBMCs from a Healthy Donor, v1 Chemistry’ from 10x Genomics.

PBMC.data <- Read10X(data.dir =  "/home/rjia0012/projects/single_cell_kidney/raw_gene_bc_matrices/hg19/",unique.features = FALSE)
genenamesPBMC <- read.table("/home/rjia0012/projects/single_cell_kidney/raw_gene_bc_matrices/hg19/genes.tsv")
rownames(PBMC.data)<-genenamesPBMC$V1
PBMC <- CreateSeuratObject(counts = PBMC.data, project = "pbmc", min.cells = 3, min.features = 100)
rm(PBMC.data)

sct

options(future.globals.maxSize= 10*1024^3)
PBMC <- SCTransform(object = PBMC, verbose = FALSE)

visualization

PBMC <- RunPCA(object = PBMC, verbose = FALSE)
PBMC <- RunUMAP(object = PBMC, dims = 1:20, verbose = FALSE)
PBMC <- FindNeighbors(object = PBMC, dims = 1:20, verbose = FALSE)
PBMC <- FindClusters(object = PBMC, verbose = FALSE)
DimPlot(object = PBMC, label = TRUE) + NoLegend() + ggtitle('sctransform') 

grouping genes based on expression abundance.

GroupGenes<-function(dataframe){
  dataframe<-data.frame(dataframe)
  group_table<-data.frame(gene_mean=log10(rowMeans(dataframe)), 
                       group=rep(1,nrow(dataframe)))
  # I think there will be no zero means in the data.
  for (i in seq(-3,1,1)){
    group_table$group[group_table$gene_mean>i]<-group_table$group[group_table$gene_mean>i]+1
    }
  return(group_table)
}

gene_group_pbmc<-GroupGenes(PBMC@assays$RNA@counts) 
rm(GroupGenes)
table(gene_group_pbmc$group) # similar to the values in the paper, their data maybe not trimmed with minimum counts etc.
gene_group_pbmc_sct=gene_group_pbmc[rownames(PBMC@assays$RNA) %in% rownames(PBMC@assays$SCT),]
gene_count<-PBMC@assays$SCT@data[gene_group_pbmc_sct$group==2,] # group 5 in the paper
cell_count<-log10(PBMC$nCount_RNA)

width=20*bw.SJ(cell_count)  ## use x or y??, error plotted when using y. 
fun <- function(v) {ksmooth(y=gene_count[v,],x=cell_count, kernel = "normal",bandwidth = width)}
a<-bplapply(1:nrow(gene_count), fun, BPPARAM=MulticoreParam(8))
# combine ksmooth result
fit_scale<-data.frame()
for (i in c(1:nrow(gene_count))){
  fit_scale<-rbind(fit_scale,cbind(a[[i]]$x,c(a[[i]]$y)))
}
colnames(fit_scale)<-c("x","y")

fit_scale<-fit_scale[!is.na(fit_scale$x),]
fit_scale<-fit_scale[!is.na(fit_scale$y),]
listb=order(fit_scale$x)
a2<-data.frame(x=fit_scale$x[listb],y=fit_scale$y[listb])
#selection
a2<-a2[a2$x>=quantile(a2$x,0.05),]
a2<-a2[a2$x<=quantile(a2$x,0.95),]

list_order=c(1:nrow(a2))
fun2<-function(i){
  list_order[a2$x>i][1]
}
INDEX=sapply(seq(min(a2$x),max(a2$x),length.out=200),fun2)
INDEX[200]=nrow(a2)
#selection
fun_selection<-function(lowx,highx){
  b=a2$y[lowx:highx]
  c(mean(b),quantile(b,0.25),quantile(b,0.75),mean(a2$x[lowx:highx]))
}
a3<-mapply(fun_selection,INDEX[1:199],INDEX[2:200])
a3<-t(a3)
colnames(a3)<-c("mean","down","up","cell_count")
#plot
data.frame(a3) %>% 
  ggplot(aes(x=cell_count,y=mean))+
  geom_line(colour="black",size=1.5)+
  geom_line(aes(y = up),colour="orange",alpha= .5) + 
  geom_line(aes(y = down),colour="orange",alpha= .5) + 
  geom_ribbon(aes(ymin=down,ymax=up),fill="orange",alpha= .5)+
  theme_minimal()+
  labs(x="Total cell UMI count (K)", y="PBMC Pearson residual",title = "group5")

image image

sessionInfo()

R version 3.5.1 (2018-07-02) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale: [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C LC_TIME=en_AU.UTF-8
[4] LC_COLLATE=en_AU.UTF-8 LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] splines grid parallel stats4 stats graphics grDevices utils datasets methods
[11] base

other attached packages: [1] phateR_0.4.1 sctransform_0.2.0 gridExtra_2.3
[4] ape_5.3 reshape2_1.4.3 VennDiagram_1.6.20
[7] futile.logger_1.4.3 gplots_3.0.1.1 zinbwave_1.4.2
[10] statmod_1.4.32 clipr_0.7.0 pheatmap_1.0.12
[13] rlang_0.4.0 umapr_0.0.0.9000 cowplot_1.0.0
[16] dplyr_0.8.3 SingleCellExperiment_1.4.1 edgeR_3.24.3
[19] limma_3.38.3 DESeq2_1.22.2 SummarizedExperiment_1.12.0 [22] DelayedArray_0.8.0 BiocParallel_1.16.6 matrixStats_0.54.0
[25] GenomicRanges_1.34.0 GenomeInfoDb_1.18.2 IRanges_2.16.0
[28] S4Vectors_0.20.1 Seurat_3.0.2 monocle_2.8.0
[31] DDRTree_0.1.5 irlba_2.3.3 VGAM_1.1-1
[34] ggplot2_3.2.0 Biobase_2.42.0 BiocGenerics_0.28.0
[37] Matrix_1.2-14

loaded via a namespace (and not attached): [1] reticulate_1.12 R.utils_2.9.0 tidyselect_0.2.5 RSQLite_2.1.1
[5] AnnotationDbi_1.44.0 htmlwidgets_1.3 combinat_0.0-8 docopt_0.6.1
[9] Rtsne_0.15 munsell_0.5.0 codetools_0.2-15 ica_1.0-2
[13] future_1.14.0 withr_2.1.2 colorspace_1.4-1 fastICA_1.2-2
[17] knitr_1.23 rstudioapi_0.10 pspline_1.0-18 ROCR_1.0-7
[21] gbRd_0.4-11 listenv_0.7.0 Rdpack_0.11-0 slam_0.1-45
[25] GenomeInfoDbData_1.2.0 bit64_0.9-7 vctrs_0.2.0 lambda.r_1.2.3
[29] xfun_0.8 R6_2.4.0 rsvd_1.0.1 locfit_1.5-9.1
[33] bitops_1.0-6 assertthat_0.2.1 SDMTools_1.1-221.1 scales_1.0.0
[37] nnet_7.3-12 gtable_0.3.0 npsurv_0.4-0 globals_0.12.4
[41] zeallot_0.1.0 genefilter_1.64.0 lazyeval_0.2.2 acepack_1.4.1
[45] checkmate_1.9.4 yaml_2.2.0 backports_1.1.4 Hmisc_4.2-0
[49] tools_3.5.1 RColorBrewer_1.1-2 stabledist_0.7-1 ggridges_0.5.1
[53] Rcpp_1.0.2 plyr_1.8.4 base64enc_0.1-3 zlibbioc_1.28.0
[57] purrr_0.3.2 RCurl_1.95-4.12 densityClust_0.3 rpart_4.1-13
[61] pbapply_1.4-1 viridis_0.5.1 zoo_1.8-6 ggrepel_0.8.1
[65] cluster_2.0.7-1 magrittr_1.5 data.table_1.12.2 futile.options_1.0.1
[69] lmtest_0.9-37 RANN_2.6.1 mvtnorm_1.0-11 fitdistrplus_1.0-14
[73] gsl_2.1-6 lsei_1.2-0 xtable_1.8-4 XML_3.98-1.20
[77] sparsesvd_0.2 HSMMSingleCell_0.114.0 compiler_3.5.1 tibble_2.1.3
[81] KernSmooth_2.23-15 crayon_1.3.4 R.oo_1.22.0 htmltools_0.3.6
[85] pcaPP_1.9-73 Formula_1.2-3 tidyr_0.8.3 geneplotter_1.60.0
[89] DBI_1.0.0 formatR_1.7 MASS_7.3-50 R.methodsS3_1.7.1
[93] gdata_2.18.0 metap_1.1 igraph_1.2.4.1 pkgconfig_2.0.2
[97] numDeriv_2016.8-1.1 foreign_0.8-70 plotly_4.9.0 foreach_1.4.4
[101] annotate_1.60.1 XVector_0.22.0 bibtex_0.4.2 stringr_1.4.0
[105] digest_0.6.20 tsne_0.1-3 copula_0.999-19.1 ADGofTest_0.3
[109] softImpute_1.4 htmlTable_1.13.1 gtools_3.8.1 nlme_3.1-137
[113] jsonlite_1.6 viridisLite_0.3.0 pillar_1.4.2 lattice_0.20-35
[117] httr_1.4.0 survival_2.42-6 glue_1.3.1 qlcMatrix_0.9.7
[121] FNN_1.1.3 iterators_1.0.10 png_0.1-7 glmnet_2.0-18
[125] bit_1.1-14 stringi_1.4.3 blob_1.2.0 latticeExtra_0.6-28
[129] caTools_1.17.1.2 memoise_1.1.0 future.apply_1.3.0

Thank you for your time and consideration in regards to this matter.

ChristophH commented 5 years ago

Hello,

I am sorry that this was giving you a hard time. Below is the code used to smooth the data and visualize the summary statistics. In this example you can replace the count matrix cm with another dataset and put the normalized expression in norm.expression. I hope this answers your questions.

library('Matrix')
library('reshape2')
library('ggplot2')
library('dplyr')
library('scales')
library('sctransform')

# load a UMI count matrix (same as the one used in 
# https://satijalab.org/seurat/v3.0/pbmc3k_tutorial.html)
cm <- readRDS(file = "~/Projects/data/pbmc3k_umi_counts.Rds")
# subset to keep only genes detected in at leat 5 cells
cm <- cm[rowSums(cm > 0) > 4, ]

# use the geometric mean of the genes to group them
# (using the arithmetic mean would usually not change things much)
gene.gmean <- sctransform:::row_gmean(cm)
gene.grp.breaks <- seq(from=min(log10(gene.gmean))-10*.Machine$double.eps, 
                       to=max(log10(gene.gmean)), length.out = 7)
gene.grp <- cut(log10(gene.gmean), breaks=gene.grp.breaks, ordered_result = TRUE)
gene.grp <- factor(gene.grp, ordered=TRUE, levels=rev(levels(gene.grp)))
names(gene.grp) <- names(gene.gmean)

# apply log-normalization
norm.expression <- log1p(t(t(cm) / colSums(cm)) * 10000)
# alternatively apply sctransform normalization
#norm.expression <- vst(umi = cm, return_gene_attr = FALSE, return_cell_attr = FALSE)$y

# convert to dense matrix format to speed things up 
# (make sure enough memory is available)
norm.expression <- as.matrix(norm.expression)

# apply the smoothing; use 200 points along the central 90% of the x range to sample 
# the smooth line; fitting is done on entire range
x <- log10(colSums(cm))
x.out <- seq(from=quantile(x, 0.05), to=quantile(x, 0.95), length.out = 200)
bw <- bw.SJ(x) * 20
norm.expression.smooth <- t(apply(norm.expression, 1, function(y) 
  ksmooth(y = scale(y), x = x, x.points = x.out, bandwidth = bw)$y))

# create a data frame for plotting
df <- melt(t(norm.expression.smooth), varnames = c('cell', 'gene'), value.name = 'expr')
# add total UMI per cell to data frame
df$log.umi <- x.out
# add gene group label
df$grp <- gene.grp[df$gene]

# create a data frame for the labels
labdf <- df %>% 
  group_by(grp) %>% 
  summarise(log.umi = median(log.umi), 
            expr = quantile(df$expr, 0.999), 
            lab = paste('Gene group', as.numeric(grp))[1])

# visualize results
# use ggplot to summarize the data and show mean and interquartile range per group

theme_set(theme_classic(base_size=9))

ggplot(df, aes(log.umi, expr)) + 
  stat_summary(aes(fill=grp, group=grp), geom="ribbon", 
               fun.ymin = function(z) { quantile(z,0.25) },
               fun.ymax = function(z) { quantile(z,0.75) }, 
               alpha=1.0) +
  scale_fill_manual(values = rev(brewer_pal(palette='YlOrRd')(7))) +
  stat_summary(fun.y = "mean", size = 1.5, geom = "line", color='black') +
  facet_wrap(~ grp, ncol = 2, scales = 'free_y') +
  coord_cartesian(ylim=quantile(df$expr, c(0.001, 0.999)), expand = FALSE) +
  geom_text(data=labdf, aes(label = lab), vjust = 1, size = 2.2, color='gray25') +
  xlab('Total cell UMI [log10]') + ylab('Scaled normalized expression') +
  theme(strip.text = element_blank()) +
  theme(legend.position="none") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
  scale_x_continuous(labels = function(x) {format(x, digits=2)})
sainadfensi commented 5 years ago

Hi Christoph, thank you so much! I use the code you provide to run on the SCT data and then get the same plot in the paper.