Closed sainadfensi closed 4 years ago
Hello,
I am sorry that this was giving you a hard time. Below is the code used to smooth the data and visualize the summary statistics. In this example you can replace the count matrix cm
with another dataset and put the normalized expression in norm.expression
. I hope this answers your questions.
library('Matrix')
library('reshape2')
library('ggplot2')
library('dplyr')
library('scales')
library('sctransform')
# load a UMI count matrix (same as the one used in
# https://satijalab.org/seurat/v3.0/pbmc3k_tutorial.html)
cm <- readRDS(file = "~/Projects/data/pbmc3k_umi_counts.Rds")
# subset to keep only genes detected in at leat 5 cells
cm <- cm[rowSums(cm > 0) > 4, ]
# use the geometric mean of the genes to group them
# (using the arithmetic mean would usually not change things much)
gene.gmean <- sctransform:::row_gmean(cm)
gene.grp.breaks <- seq(from=min(log10(gene.gmean))-10*.Machine$double.eps,
to=max(log10(gene.gmean)), length.out = 7)
gene.grp <- cut(log10(gene.gmean), breaks=gene.grp.breaks, ordered_result = TRUE)
gene.grp <- factor(gene.grp, ordered=TRUE, levels=rev(levels(gene.grp)))
names(gene.grp) <- names(gene.gmean)
# apply log-normalization
norm.expression <- log1p(t(t(cm) / colSums(cm)) * 10000)
# alternatively apply sctransform normalization
#norm.expression <- vst(umi = cm, return_gene_attr = FALSE, return_cell_attr = FALSE)$y
# convert to dense matrix format to speed things up
# (make sure enough memory is available)
norm.expression <- as.matrix(norm.expression)
# apply the smoothing; use 200 points along the central 90% of the x range to sample
# the smooth line; fitting is done on entire range
x <- log10(colSums(cm))
x.out <- seq(from=quantile(x, 0.05), to=quantile(x, 0.95), length.out = 200)
bw <- bw.SJ(x) * 20
norm.expression.smooth <- t(apply(norm.expression, 1, function(y)
ksmooth(y = scale(y), x = x, x.points = x.out, bandwidth = bw)$y))
# create a data frame for plotting
df <- melt(t(norm.expression.smooth), varnames = c('cell', 'gene'), value.name = 'expr')
# add total UMI per cell to data frame
df$log.umi <- x.out
# add gene group label
df$grp <- gene.grp[df$gene]
# create a data frame for the labels
labdf <- df %>%
group_by(grp) %>%
summarise(log.umi = median(log.umi),
expr = quantile(df$expr, 0.999),
lab = paste('Gene group', as.numeric(grp))[1])
# visualize results
# use ggplot to summarize the data and show mean and interquartile range per group
theme_set(theme_classic(base_size=9))
ggplot(df, aes(log.umi, expr)) +
stat_summary(aes(fill=grp, group=grp), geom="ribbon",
fun.ymin = function(z) { quantile(z,0.25) },
fun.ymax = function(z) { quantile(z,0.75) },
alpha=1.0) +
scale_fill_manual(values = rev(brewer_pal(palette='YlOrRd')(7))) +
stat_summary(fun.y = "mean", size = 1.5, geom = "line", color='black') +
facet_wrap(~ grp, ncol = 2, scales = 'free_y') +
coord_cartesian(ylim=quantile(df$expr, c(0.001, 0.999)), expand = FALSE) +
geom_text(data=labdf, aes(label = lab), vjust = 1, size = 2.2, color='gray25') +
xlab('Total cell UMI [log10]') + ylab('Scaled normalized expression') +
theme(strip.text = element_blank()) +
theme(legend.position="none") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) +
scale_x_continuous(labels = function(x) {format(x, digits=2)})
Hi Christoph, thank you so much! I use the code you provide to run on the SCT data and then get the same plot in the paper.
Dear Team,
I am learning the single-cell RNA-seq with the Seurat package every day. Thank you for developing the sctransform to provide a novel statistical technique for normalization.
In the paper[1] about the sctransform, multiple statistical methods are applied to show the convincing effectiveness with loads of plots. In particular, I am interested in the figures in the paper which show the relationship between gene expression and the cell sequencing depth (Fig 1C, D and Fig 3A in the paper). However, when I try to repeat the plot by myself, the result turns out to be different. I believe that there must be some misunderstandings in the steps of plotting the figure. Therefore, it would be so kind if you could help me to go into details on the steps of plotting the trends in the data before and after normalization (Methods 4.4 in the paper).
The method described in the paper is: 4.4 Trends in the data before and after normalization [1] We grouped genes into six bins based on log10-transformed mean UMI count, using bins of equal width. To show the overall trends in the data, for every gene we fit the expression (UMI counts, scaled log-normalized expression, scaled Pearson residuals) as a function of log10-transformed mean UMI count using kernel regression (ksmooth function) with normal kernel and large bandwidth (20 times the size suggested by R function bw.SJ). For visualization, we only used the central 90% of cells based on total UMI. For every gene group, we show the expression range after smoothing from the first to third quartile at 200 equidistant cell UMI values
Figure 1. Figures that show the relationship between gene expression and the cell sequencing depth (Fig 1C, D and Fig 3A in the paper[1]).
Questions:
I also put my R commands here:
check the data in the paper The dataset used in the main text is ‘33k PBMCs from a Healthy Donor, v1 Chemistry’ from 10x Genomics.
sct
visualization
grouping genes based on expression abundance.
R version 3.5.1 (2018-07-02) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04 LTS
Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale: [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C LC_TIME=en_AU.UTF-8
[4] LC_COLLATE=en_AU.UTF-8 LC_MONETARY=en_AU.UTF-8 LC_MESSAGES=en_AU.UTF-8
[7] LC_PAPER=en_AU.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
attached base packages: [1] splines grid parallel stats4 stats graphics grDevices utils datasets methods
[11] base
other attached packages: [1] phateR_0.4.1 sctransform_0.2.0 gridExtra_2.3
[4] ape_5.3 reshape2_1.4.3 VennDiagram_1.6.20
[7] futile.logger_1.4.3 gplots_3.0.1.1 zinbwave_1.4.2
[10] statmod_1.4.32 clipr_0.7.0 pheatmap_1.0.12
[13] rlang_0.4.0 umapr_0.0.0.9000 cowplot_1.0.0
[16] dplyr_0.8.3 SingleCellExperiment_1.4.1 edgeR_3.24.3
[19] limma_3.38.3 DESeq2_1.22.2 SummarizedExperiment_1.12.0 [22] DelayedArray_0.8.0 BiocParallel_1.16.6 matrixStats_0.54.0
[25] GenomicRanges_1.34.0 GenomeInfoDb_1.18.2 IRanges_2.16.0
[28] S4Vectors_0.20.1 Seurat_3.0.2 monocle_2.8.0
[31] DDRTree_0.1.5 irlba_2.3.3 VGAM_1.1-1
[34] ggplot2_3.2.0 Biobase_2.42.0 BiocGenerics_0.28.0
[37] Matrix_1.2-14
loaded via a namespace (and not attached): [1] reticulate_1.12 R.utils_2.9.0 tidyselect_0.2.5 RSQLite_2.1.1
[5] AnnotationDbi_1.44.0 htmlwidgets_1.3 combinat_0.0-8 docopt_0.6.1
[9] Rtsne_0.15 munsell_0.5.0 codetools_0.2-15 ica_1.0-2
[13] future_1.14.0 withr_2.1.2 colorspace_1.4-1 fastICA_1.2-2
[17] knitr_1.23 rstudioapi_0.10 pspline_1.0-18 ROCR_1.0-7
[21] gbRd_0.4-11 listenv_0.7.0 Rdpack_0.11-0 slam_0.1-45
[25] GenomeInfoDbData_1.2.0 bit64_0.9-7 vctrs_0.2.0 lambda.r_1.2.3
[29] xfun_0.8 R6_2.4.0 rsvd_1.0.1 locfit_1.5-9.1
[33] bitops_1.0-6 assertthat_0.2.1 SDMTools_1.1-221.1 scales_1.0.0
[37] nnet_7.3-12 gtable_0.3.0 npsurv_0.4-0 globals_0.12.4
[41] zeallot_0.1.0 genefilter_1.64.0 lazyeval_0.2.2 acepack_1.4.1
[45] checkmate_1.9.4 yaml_2.2.0 backports_1.1.4 Hmisc_4.2-0
[49] tools_3.5.1 RColorBrewer_1.1-2 stabledist_0.7-1 ggridges_0.5.1
[53] Rcpp_1.0.2 plyr_1.8.4 base64enc_0.1-3 zlibbioc_1.28.0
[57] purrr_0.3.2 RCurl_1.95-4.12 densityClust_0.3 rpart_4.1-13
[61] pbapply_1.4-1 viridis_0.5.1 zoo_1.8-6 ggrepel_0.8.1
[65] cluster_2.0.7-1 magrittr_1.5 data.table_1.12.2 futile.options_1.0.1
[69] lmtest_0.9-37 RANN_2.6.1 mvtnorm_1.0-11 fitdistrplus_1.0-14
[73] gsl_2.1-6 lsei_1.2-0 xtable_1.8-4 XML_3.98-1.20
[77] sparsesvd_0.2 HSMMSingleCell_0.114.0 compiler_3.5.1 tibble_2.1.3
[81] KernSmooth_2.23-15 crayon_1.3.4 R.oo_1.22.0 htmltools_0.3.6
[85] pcaPP_1.9-73 Formula_1.2-3 tidyr_0.8.3 geneplotter_1.60.0
[89] DBI_1.0.0 formatR_1.7 MASS_7.3-50 R.methodsS3_1.7.1
[93] gdata_2.18.0 metap_1.1 igraph_1.2.4.1 pkgconfig_2.0.2
[97] numDeriv_2016.8-1.1 foreign_0.8-70 plotly_4.9.0 foreach_1.4.4
[101] annotate_1.60.1 XVector_0.22.0 bibtex_0.4.2 stringr_1.4.0
[105] digest_0.6.20 tsne_0.1-3 copula_0.999-19.1 ADGofTest_0.3
[109] softImpute_1.4 htmlTable_1.13.1 gtools_3.8.1 nlme_3.1-137
[113] jsonlite_1.6 viridisLite_0.3.0 pillar_1.4.2 lattice_0.20-35
[117] httr_1.4.0 survival_2.42-6 glue_1.3.1 qlcMatrix_0.9.7
[121] FNN_1.1.3 iterators_1.0.10 png_0.1-7 glmnet_2.0-18
[125] bit_1.1-14 stringi_1.4.3 blob_1.2.0 latticeExtra_0.6-28
[129] caTools_1.17.1.2 memoise_1.1.0 future.apply_1.3.0
Thank you for your time and consideration in regards to this matter.