Fail to reproduce vignette for atac-seq integration

lldelisle commented 5 years ago

Hi, I was very interested in your last Cell paper and I wanted to present it in a Journal Club. I am trying to reproduce the vignette for the atac-seq integration. I downloaded all data following the links and the gtf from ensembl. I ran the following commands:

library(Seurat)
library(ggplot2)
library(hdf5r)

peaks <- Read10X_h5("atac_v1_pbmc_10k_filtered_peak_bc_matrix.h5")
activity.matrix <- CreateGeneActivityMatrix(peak.matrix = peaks, annotation.file = "Homo_sapiens.GRCh38.82.gtf.gz", 
    seq.levels = c(1:22, "X", "Y"), upstream = 2000, verbose = TRUE)
pbmc.atac <- CreateSeuratObject(counts = peaks, assay = "ATAC", project = "10x_ATAC")
pbmc.atac[["ACTIVITY"]] <- CreateAssayObject(counts = activity.matrix)
meta <- read.table("atac_v1_pbmc_10k_singlecell.csv", sep = ",", header = TRUE, row.names = 1, 
    stringsAsFactors = FALSE)
meta <- meta[colnames(pbmc.atac), ]
pbmc.atac <- AddMetaData(pbmc.atac, metadata = meta)
pbmc.atac <- subset(pbmc.atac, subset = nCount_ATAC > 5000)
pbmc.atac$tech <- "atac"
DefaultAssay(pbmc.atac) <- "ACTIVITY"
# Warning: Feature names cannot have underscores ('_'), replacing with dashes ('-')
pbmc.atac <- FindVariableFeatures(pbmc.atac)
pbmc.atac <- NormalizeData(pbmc.atac)
pbmc.atac <- ScaleData(pbmc.atac)
DefaultAssay(pbmc.atac) <- "ATAC"
VariableFeatures(pbmc.atac) <- names(which(Matrix::rowSums(pbmc.atac) > 100))
pbmc.atac <- RunLSI(pbmc.atac, n = 50, scale.max = NULL)
# Warning messages:
# 1: RunLSI is being moved to Signac. Equivalent functionality can be achieved via the Signac::RunTFIDF and Signac::RunSVD functions; for more information on Signac, please see https://github.com/timoast/Signac 
# 2: RunLSI is being moved to Signac. Equivalent functionality can be achieved via the Signac::RunTFIDF and Signac::RunSVD functions; for more information on Signac, please see https://github.com/timoast/Signac 
# 3: RunLSI is being moved to Signac. Equivalent functionality can be achieved via the Signac::RunTFIDF and Signac::RunSVD functions; for more information on Signac, please see https://github.com/timoast/Signac 

pbmc.atac <- RunUMAP(pbmc.atac, reduction = "lsi", dims = 1:50, umap.method = 'umap-learn', metric = 'correlation')
# I could not run the default RunUMAP with uwa as I got a segmentation fault:
# pbmc.atac <- RunUMAP(pbmc.atac, reduction = "lsi", dims = 1:50)
# Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
# To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
# This message will be shown once per session
# 
#  *** caught segfault ***
# address 0xfffffffffffffff7, cause 'memory not mapped'
# 
# Traceback:
#  1: RcppParallel::setThreadOptions(numThreads = n_threads)
#  2: uwot(X = X, n_neighbors = n_neighbors, n_components = n_components,     metric = metric, n_epochs = n_epochs, alpha = learning_rate,     scale = scale, init = init, init_sdev = init_sdev, spread = spread,     min_dist = min_dist, set_op_mix_ratio = set_op_mix_ratio,     local_connectivity = local_connectivity, bandwidth = bandwidth,     gamma = repulsion_strength, negative_sample_rate = negative_sample_rate,     a = a, b = b, nn_method = nn_method, n_trees = n_trees, search_k = search_k,     method = "umap", approx_pow = approx_pow, n_threads = n_threads,     n_sgd_threads = n_sgd_threads, grain_size = grain_size, y = y,     target_n_neighbors = target_n_neighbors, target_weight = target_weight,     target_metric = target_metric, pca = pca, pca_center = pca_center,     pcg_rand = pcg_rand, fast_sgd = fast_sgd, ret_model = ret_model,     ret_nn = ret_nn, tmpdir = tempdir(), verbose = verbose)
#  3: umap(X = object, n_threads = nbrOfWorkers(), n_neighbors = as.integer(x = n.neighbors),     n_components = as.integer(x = n.components), metric = metric,     n_epochs = n.epochs, learning_rate = learning.rate, min_dist = min.dist,     spread = spread, set_op_mix_ratio = set.op.mix.ratio, local_connectivity = local.connectivity,     repulsion_strength = repulsion.strength, negative_sample_rate = negative.sample.rate,     a = a, b = b, verbose = verbose)
#  4: RunUMAP.default(object = data.use, assay = assay, umap.method = umap.method,     n.neighbors = n.neighbors, n.components = n.components, metric = metric,     n.epochs = n.epochs, learning.rate = learning.rate, min.dist = min.dist,     spread = spread, set.op.mix.ratio = set.op.mix.ratio, local.connectivity = local.connectivity,     repulsion.strength = repulsion.strength, negative.sample.rate = negative.sample.rate,     a = a, b = b, seed.use = seed.use, metric.kwds = metric.kwds,     angular.rp.forest = angular.rp.forest, reduction.key = reduction.key,     verbose = verbose)
#  5: RunUMAP(object = data.use, assay = assay, umap.method = umap.method,     n.neighbors = n.neighbors, n.components = n.components, metric = metric,     n.epochs = n.epochs, learning.rate = learning.rate, min.dist = min.dist,     spread = spread, set.op.mix.ratio = set.op.mix.ratio, local.connectivity = local.connectivity,     repulsion.strength = repulsion.strength, negative.sample.rate = negative.sample.rate,     a = a, b = b, seed.use = seed.use, metric.kwds = metric.kwds,     angular.rp.forest = angular.rp.forest, reduction.key = reduction.key,     verbose = verbose)
#  6: RunUMAP.Seurat(pbmc.atac, reduction = "lsi", dims = 1:50)
#  7: RunUMAP(pbmc.atac, reduction = "lsi", dims = 1:50)
# 

g <- DimPlot(pbmc.atac, reduction = "umap")
ggsave("UMAP_ATAC_allPeaks_fromLSI.pdf", g)

pbmc.rna <- readRDS("pbmc_10k_v3.rds")
pbmc.rna$tech <- "rna"

p1 <- DimPlot(pbmc.atac, reduction = "umap") + NoLegend() + ggtitle("scATAC-seq")
p2 <- DimPlot(pbmc.rna, group.by = "celltype", label = TRUE, repel = TRUE) + NoLegend() + ggtitle("scRNA-seq")
g <- CombinePlots(plots = list(p1, p2))

ggsave("UMAP_both_before_anchor.pdf", g, width = 14)

transfer.anchors <- FindTransferAnchors(reference = pbmc.rna, query = pbmc.atac, features = VariableFeatures(object = pbmc.rna), 
    reference.assay = "RNA", query.assay = "ACTIVITY", reduction = "cca")

celltype.predictions <- TransferData(anchorset = transfer.anchors, refdata = pbmc.rna$celltype, 
    weight.reduction = pbmc.atac[["lsi"]])

pbmc.atac <- AddMetaData(pbmc.atac, metadata = celltype.predictions)
pdf("prediction_score.pdf")
hist(pbmc.atac$prediction.score.max)
abline(v = 0.5, col = "red")
dev.off()

pbmc.atac.filtered <- subset(pbmc.atac, subset = prediction.score.max > 0.5)
pbmc.rna$celltype <- factor(pbmc.rna$celltype)
pbmc.atac.filtered$predicted.id <- factor(pbmc.atac.filtered$predicted.id, levels = levels(pbmc.rna$celltype))  # to make the colors match
p1 <- DimPlot(pbmc.atac.filtered, group.by = "predicted.id", label = TRUE, repel = TRUE) + ggtitle("scATAC-seq cells") + 
    NoLegend() + scale_colour_hue(drop = FALSE)
p2 <- DimPlot(pbmc.rna, group.by = "celltype", label = TRUE, repel = TRUE) + ggtitle("scRNA-seq cells") + 
    NoLegend()
g <- CombinePlots(plots = list(p1, p2))
ggsave("UMAP_both_withAnnot.pdf", g, width = 14)

g <- DimPlot(pbmc.atac.filtered, group.by = "predicted.id")
ggsave("UMAP_with_predicted_id.pdf", g, width = 14)
sessionInfo()
# R version 3.6.0 (2019-04-26)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Red Hat Enterprise Linux Server 7.6 (Maipo)
# 
# Matrix products: default
# BLAS/LAPACK: /ssoft/spack/humagne/v1/opt/spack/linux-rhel7-x86_S6g1_Mellanox/gcc-7.4.0/openblas-0.3.6-rouolqsvwzf5lybkuo6h7as6xoyltdfo/lib/libopenblas_haswellp-r0.3.6.so
# 
# locale:
#  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
# [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
# 
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] hdf5r_1.2.0   ggplot2_3.2.1 Seurat_3.1.0 
# 
# loaded via a namespace (and not attached):
#   [1] nlme_3.1-141                tsne_0.1-3                 
#   [3] matrixStats_0.54.0          bitops_1.0-6               
#   [5] bit64_0.9-7                 RcppAnnoy_0.0.12           
#   [7] RColorBrewer_1.1-2          httr_1.4.1                 
#   [9] GenomeInfoDb_1.20.0         sctransform_0.2.0          
#  [11] tools_3.6.0                 R6_2.4.0                   
#  [13] irlba_2.3.3                 KernSmooth_2.23-15         
#  [15] BiocGenerics_0.30.0         uwot_0.1.3                 
#  [17] lazyeval_0.2.2              colorspace_1.4-1           
#  [19] npsurv_0.4-0                withr_2.1.2                
#  [21] tidyselect_0.2.5            gridExtra_2.3              
#  [23] bit_1.1-14                  compiler_3.6.0             
#  [25] Biobase_2.44.0              DelayedArray_0.10.0        
#  [27] plotly_4.9.0                labeling_0.3               
#  [29] rtracklayer_1.44.3          caTools_1.17.1.2           
#  [31] scales_1.0.0                lmtest_0.9-37              
#  [33] ggridges_0.5.1              pbapply_1.4-2              
#  [35] Rsamtools_2.0.0             stringr_1.4.0              
#  [37] digest_0.6.20               R.utils_2.9.0              
#  [39] XVector_0.24.0              pkgconfig_2.0.2            
#  [41] htmltools_0.3.6             bibtex_0.4.2               
#  [43] htmlwidgets_1.3             rlang_0.4.0                
#  [45] zoo_1.8-6                   jsonlite_1.6               
#  [47] BiocParallel_1.18.1         ica_1.0-2                  
#  [49] gtools_3.8.1                dplyr_0.8.3                
#  [51] R.oo_1.22.0                 RCurl_1.95-4.12            
#  [53] magrittr_1.5                GenomeInfoDbData_1.2.1     
#  [55] Matrix_1.2-17               S4Vectors_0.22.0           
#  [57] Rcpp_1.0.2                  munsell_0.5.0              
#  [59] ape_5.3                     reticulate_1.13            
#  [61] R.methodsS3_1.7.1           stringi_1.4.3              
#  [63] SummarizedExperiment_1.14.1 zlibbioc_1.30.0            
#  [65] gbRd_0.4-11                 MASS_7.3-51.4              
#  [67] gplots_3.0.1.1              Rtsne_0.15                 
#  [69] plyr_1.8.4                  grid_3.6.0                 
#  [71] parallel_3.6.0              gdata_2.18.0               
#  [73] listenv_0.7.0               ggrepel_0.8.1              
#  [75] crayon_1.3.4                lattice_0.20-38            
#  [77] Biostrings_2.52.0           cowplot_1.0.0              
#  [79] splines_3.6.0               SDMTools_1.1-221.1         
#  [81] pillar_1.4.2                GenomicRanges_1.36.0       
#  [83] igraph_1.2.4.1              stats4_3.6.0               
#  [85] future.apply_1.3.0          reshape2_1.4.3             
#  [87] codetools_0.2-16            leiden_0.3.1               
#  [89] XML_3.98-1.20               glue_1.3.1                 
#  [91] lsei_1.2-0                  metap_1.1                  
#  [93] data.table_1.12.2           RcppParallel_4.4.3         
#  [95] png_0.1-7                   Rdpack_0.11-0              
#  [97] gtable_0.3.0                RANN_2.6.1                 
#  [99] purrr_0.3.2                 tidyr_0.8.3                
# [101] future_1.14.0               assertthat_0.2.1           
# [103] rsvd_1.0.2                  survival_2.44-1.1          
# [105] viridisLite_0.3.0           tibble_2.1.3               
# [107] GenomicAlignments_1.20.1    IRanges_2.18.2             
# [109] cluster_2.1.0               globals_0.12.4             
# [111] fitdistrplus_1.0-14         ROCR_1.0-7

Unfortunately, when I ran the umap on ATAC-Seq I got a segmentation fault. So, I used the python version as you suggested. The UMAP is quite different from expected: UMAP_both_before_anchor.pdf When, I looked at the prediction scores, they were much worse than in your vignette: prediction_score.pdf The UMAP with annotation is not really as expected. Especially, there is no pDC... UMAP_both_withAnnot.pdf

I probably forgot something...

Thanks for your help,

Lucille

andrewwbutler commented 5 years ago

Hi Lucille,

That is strange. I would expected that the UMAP plots won't look the same between the R and Python implementations but the anchor scores shouldn't change. Can you try downloading pbmc_10k_v3.rds again? It might be possible you're working with an older version of that file. Also, the vignettes were built using this latest docker image (https://hub.docker.com/r/satijalab/seurat/tags) if you want to see if that works for you.

lldelisle commented 5 years ago

Hi, I downloaded pbmc_10k_v3.rds which indeed had been updated. However, when I rerun the commands with the new rds, the scores were exactly the same. I was wondering if you had the markdown source for the vignette to check and by any chance, the transfer.anchors dataset to check if the problem is to find the anchors or to transfer the data. Many thanks, Lucille

andrewwbutler commented 5 years ago

Here's a link to download the RMD source. You may have to alter the paths but hopefully knitting it from scratch will work.

lldelisle commented 5 years ago

I feel very stupid. I used Homo_sapiens.GRCh38.82.gtf instead of Homo_sapiens.GRCh37.82.gtf (hg38 instead of hg19)... Maybe you could add the link to the gtf (ftp://ftp.ensembl.org/pub/grch37/release-82/gtf/homo_sapiens/Homo_sapiens.GRCh37.82.gtf.gz) in the vignette to avoid this error. Thanks for your time and really sorry.

satijalab / seurat

Fail to reproduce vignette for atac-seq integration #2083