Closed drbecavin closed 5 years ago
sessionInfo() R version 3.5.2 (2018-12-20) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 10 (buster)
Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.5.so
locale: [1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8 LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8 LC_PAPER=fr_FR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=Cattached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base
other attached packages: [1] GenomicRanges_1.34.0 GenomeInfoDb_1.18.2 IRanges_2.16.0 S4Vectors_0.20.1 BiocGenerics_0.28.0 ggplot2_3.2.0 Seurat_3.0.3.9010 dplyr_0.8.3
loaded via a namespace (and not attached): [1] Rtsne_0.15 colorspace_1.4-1 ggridges_0.5.1 XVector_0.22.0 rstudioapi_0.10 listenv_0.7.0 npsurv_0.4-0 ggrepel_0.8.1
[9] bit64_0.9-7 codetools_0.2-16 splines_3.5.2 R.methodsS3_1.7.1 lsei_1.2-0 knitr_1.23 jsonlite_1.6 Rsamtools_1.34.1
[17] ica_1.0-2 cluster_2.1.0 png_0.1-7 R.oo_1.22.0 sctransform_0.2.0 compiler_3.5.2 httr_1.4.0 assertthat_0.2.1
[25] Matrix_1.2-17 lazyeval_0.2.2 htmltools_0.3.6 tools_3.5.2 rsvd_1.0.1 igraph_1.2.4.1 gtable_0.3.0 glue_1.3.1
[33] GenomeInfoDbData_1.2.1 RANN_2.6.1 reshape2_1.4.3 Rcpp_1.0.1 Biobase_2.42.0 Biostrings_2.50.2 gdata_2.18.0 ape_5.3
[41] nlme_3.1-140 rtracklayer_1.42.2 gbRd_0.4-11 lmtest_0.9-37 xfun_0.8 stringr_1.4.0 globals_0.12.4 irlba_2.3.3
[49] gtools_3.8.1 XML_3.98-1.20 future_1.14.0 MASS_7.3-51.4 zlibbioc_1.28.0 zoo_1.8-6 scales_1.0.0 SummarizedExperiment_1.12.0 [57] RColorBrewer_1.1-2 yaml_2.2.0 reticulate_1.12 pbapply_1.4-0 gridExtra_2.3 stringi_1.4.3 caTools_1.17.1.2 BiocParallel_1.16.6
[65] bibtex_0.4.2 Rdpack_0.11-0 SDMTools_1.1-221.1 rlang_0.4.0 pkgconfig_2.0.2 bitops_1.0-6 matrixStats_0.54.0 lattice_0.20-38
[73] ROCR_1.0-7 purrr_0.3.2 GenomicAlignments_1.18.1 htmlwidgets_1.3 bit_1.1-14 cowplot_1.0.0 tidyselect_0.2.5 RcppAnnoy_0.0.12
[81] plyr_1.8.4 magrittr_1.5 R6_2.4.0 gplots_3.0.1.1 DelayedArray_0.8.0 pillar_1.4.2 withr_2.1.2 fitdistrplus_1.0-14
[89] survival_2.44-1.1 RCurl_1.95-4.12 tibble_2.1.3 future.apply_1.3.0 tsne_0.1-3 crayon_1.3.4 hdf5r_1.2.0 KernSmooth_2.23-15
[97] plotly_4.9.0 grid_3.5.2 data.table_1.12.2 metap_1.1 digest_0.6.20 tidyr_0.8.3 R.utils_2.9.0 munsell_0.5.0
[105] viridisLite_0.3.0
Can you try the suggestions in https://github.com/satijalab/seurat/issues/1918 and see if they fix the problem?
Thanks ! I have tried that:
peaks <- Read10X_h5("atacseq10x_pig_wt/filtered_peak_bc_matrix.h5")
rownames(peaks) <- paste("chr", rownames(peaks), sep="")
chromosomes = c(1:18, "X", "Y")
chromosomes = paste("chr", chromosomes, sep="")
activity.matrix <- CreateGeneActivityMatrix(peak.matrix = peaks, annotation.file = "Genome/Sscrofa11.1.91.scAtacSeq.v2.gtf", seq.levels = chromosomes, upstream = 2000, verbose = TRUE, include.body = FALSE)
With no success ...
What about the
Extend()
function ? Where does it come from ?
Can you print the first few lines of your gtf file and the output of head(rownames(peaks))
?
Extend
is defined in Seurat but not exported.
First few lines of GTF used by Cellranger-atac:
1 ensembl transcript 3472 18546 . - . gene_id "ENSSSCG00000037372"; gene_version "1"; transcript_id "ENSSSCT00000065539"; transcript_version "1"; gene_source "ensembl"; gene_type "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; gene_name "ENSSSCG00000037372"; 1 ensembl transcript 3472 18546 . - . gene_id "ENSSSCG00000037372"; gene_version "1"; transcript_id "ENSSSCT00000042326"; transcript_version "1"; gene_source "ensembl"; gene_type "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; gene_name "ENSSSCG00000037372"; 1 ensembl transcript 5659 18696 . - . gene_id "ENSSSCG00000037372"; gene_version "1"; transcript_id "ENSSSCT00000046027"; transcript_version "1"; gene_source "ensembl"; gene_type "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; gene_name "ENSSSCG00000037372"; 1 ensembl transcript 23368 39931 . + . gene_id "ENSSSCG00000027257"; gene_version "2"; transcript_id "ENSSSCT00000058189"; transcript_version "1"; gene_name "PSMB1"; gene_source "ensembl"; gene_type "protein_coding"; transcript_name "PSMB1-202"; transcript_source "ensembl"; transcript_biotype "protein_coding"; 1 ensembl transcript 23853 40113 . + . gene_id "ENSSSCG00000027257"; gene_version "2"; transcript_id "ENSSSCT00000023500"; transcript_version "2"; gene_name "PSMB1"; gene_source "ensembl"; gene_type "protein_coding"; transcript_name "PSMB1-201"; transcript_source "ensembl"; transcript_biotype "protein_coding";
and output of head(rownames(peaks))
[1] "1:4427-5058" "1:21725-22544" "1:23302-24212" "1:43674-44167" "1:54085-54489" "1:67850-70422"
I have tried this to add manually the UCSC format = "chr" in front of chromosomes
peaks_pig_wt <- Read10X_h5("atacseq10x_pig_wt/filtered_peak_bc_matrix.h5")
rownames(peaks_pig_wt) <- paste("chr", rownames(peaks_pig_wt), sep="")
peaks_new <- peaks_pig_wt
head(rownames(peaks_pig_wt))
dimnames(peaks_new)[[1]] <- paste("chr", dimnames(peaks_pig_wt)[[1]], sep="")
chromosomes = c(1:18, "X", "Y")
chromosomes = paste("chr", chromosomes, sep="")
activity.matrix <- CreateGeneActivityMatrix(peak.matrix = peaks_pig_wt, annotation.file = "/home/becavin/Genome/Sscrofa11.1.91.scAtacSeq.v2.gtf", seq.levels = chromosomes, upstream = 2000, verbose = TRUE)
Still not working with error:
Error in
$<-.data.frame
(*tmp*
, "peak", value = ":-") : replacement has 1 row, data has 0
Do you have a link to the gtf file?
My GTF is extracted from this one: ftp://ftp.ensembl.org/pub/release-91/gtf/sus_scrofa/
I have parsed it to make it compliant with cellranger-atac software. The parsing being :
I am running Cellranger-atac again with new annotation with chromosome names having "chr" in their name. Just to test !
Ok, Solved !!!
I have found 3 bugs/blocking instruction
There is 3 thing in your function Seurat::CreateGeneActivityMatrix
which make it crashes with custom annotation.
if the chromosome names are not properly formatted with "chr" it will stop at
gene.distances <- GenomicRanges::distanceToNearest(x = peaks.gr, subject = gtf.body_prom)
. It is due to this line
GenomeInfoDb::seqlevelsStyle(gtf) <- "UCSC"
It can be partially avoided by #1918 as you mentioned. But maybe one can remove UCSC formatting. I am not sure if it useful here.
if some peaks are starting at position 0 in one chromosome it crashes also at
gene.distances <- GenomicRanges::distanceToNearest(x = peaks.gr, subject = gtf.body_prom)
One fix I found is to change all peaks starting at 0 to position 1
peaks <- Read10X_h5("filtered_peak_bc_matrix.h5")
rownames(peaks) <- gsub(rownames(peaks), pattern = ":0-", replacement = ":1-")
dimnames(peaks)[[1]] <- gsub(dimnames(peaks)[[1]], pattern = ":0-", replacement = ":1-")
Finally for peak annotation you are using the GTF feature "gene_name"
peak.ids$gene.name <- gene.ids$gene_name
For some (most) of the custom GTF this feature will not be present for all genes. I fixed that by adding this line in the function definition
peak.ids <- peaks.gr[S4Vectors::queryHits(x = keep.overlaps)]
gene.ids <- gtf.genes[S4Vectors::subjectHits(x = keep.overlaps)]
gene.ids$gene_name[is.na(gene.ids$gene_name)] = gene.ids$gene_id[is.na(gene.ids$gene_name)]
peak.ids$gene.name <- gene.ids$gene_name
(I can submit a pull request tomorrow with these fix if you want. I think it will help many people wanting to try Seurat for scATACSeq with custom reference, before we all switch to Signac.)
Hello, I am using your AtacSeq tutorial to compare RNAseq vs AtacSeq of Pig cells. I have used succesfully your tutorial for human datasets but now it is giving me errors when running it on Pig genome annotation.
Here is what I am running:
I got this Traceback :
Not really explicit has you can see.
I tried to debug it by running step by step the function : CreateGeneActivityMatrix
gtf.body_prom <- Extend(x = gtf.genes, upstream = upstream, downstream = downstream)
GenomeInfoDb::seqlevelsStyle(gtf) <- "UCSC"
andgene.distances <- GenomicRanges::distanceToNearest(x = peaks.gr, subject = gtf.body_prom)
are executed, it seems to give errors.peak.df <- do.call(what = rbind, args = strsplit(x = gsub(peak.df, pattern = ":", replacement = "-"), split = "-"))"
In any case, I tried to fix all that without success. Help ? :)