satijalab / seurat

R toolkit for single cell genomics
http://www.satijalab.org/seurat
Other
2.29k stars 914 forks source link

Problem with peak vs annotation in AtacSeq #2085

Closed drbecavin closed 5 years ago

drbecavin commented 5 years ago

Hello, I am using your AtacSeq tutorial to compare RNAseq vs AtacSeq of Pig cells. I have used succesfully your tutorial for human datasets but now it is giving me errors when running it on Pig genome annotation.

Here is what I am running:

peaks <- Read10X_h5("filtered_peak_bc_matrix.h5")
chromosomes = c(1:18, "X", "Y")
activity.matrix <- CreateGeneActivityMatrix(peak.matrix = peaks, annotation.file = "/home/becavin/Genome/Sscrofa11.2.gtf", seq.levels = chromosomes, upstream = 2000, verbose = TRUE, include.body = FALSE)

I got this Traceback :

Error in $<-.data.frame(*tmp*, "peak", value = ":-") : replacement has 1 row, data has 0
4. stop(sprintf(ngettext(N, "replacement has %d row, data has %d", "replacement has %d rows, data has %d"), N, nrows), domain = NA)
3. $<-.data.frame`(`*tmp*`, "peak", value = ":-")
2. $<-(*tmp*, "peak", value = ":-")
1. CreateGeneActivityMatrix(peak.matrix = peaks, annotation.file = "/home/becavin/Genome/Sscrofa11.2.gtf", seq.levels = chromosomes, upstream = 2000, verbose = TRUE, include.body = FALSE)

Not really explicit has you can see.

I tried to debug it by running step by step the function : CreateGeneActivityMatrix

In any case, I tried to fix all that without success. Help ? :)

drbecavin commented 5 years ago

sessionInfo() R version 3.5.2 (2018-12-20) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Debian GNU/Linux 10 (buster)

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/openblas/libblas.so.3 LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.5.so

locale: [1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8 LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8 LC_PAPER=fr_FR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] GenomicRanges_1.34.0 GenomeInfoDb_1.18.2 IRanges_2.16.0 S4Vectors_0.20.1 BiocGenerics_0.28.0 ggplot2_3.2.0 Seurat_3.0.3.9010 dplyr_0.8.3

loaded via a namespace (and not attached): [1] Rtsne_0.15 colorspace_1.4-1 ggridges_0.5.1 XVector_0.22.0 rstudioapi_0.10 listenv_0.7.0 npsurv_0.4-0 ggrepel_0.8.1
[9] bit64_0.9-7 codetools_0.2-16 splines_3.5.2 R.methodsS3_1.7.1 lsei_1.2-0 knitr_1.23 jsonlite_1.6 Rsamtools_1.34.1
[17] ica_1.0-2 cluster_2.1.0 png_0.1-7 R.oo_1.22.0 sctransform_0.2.0 compiler_3.5.2 httr_1.4.0 assertthat_0.2.1
[25] Matrix_1.2-17 lazyeval_0.2.2 htmltools_0.3.6 tools_3.5.2 rsvd_1.0.1 igraph_1.2.4.1 gtable_0.3.0 glue_1.3.1
[33] GenomeInfoDbData_1.2.1 RANN_2.6.1 reshape2_1.4.3 Rcpp_1.0.1 Biobase_2.42.0 Biostrings_2.50.2 gdata_2.18.0 ape_5.3
[41] nlme_3.1-140 rtracklayer_1.42.2 gbRd_0.4-11 lmtest_0.9-37 xfun_0.8 stringr_1.4.0 globals_0.12.4 irlba_2.3.3
[49] gtools_3.8.1 XML_3.98-1.20 future_1.14.0 MASS_7.3-51.4 zlibbioc_1.28.0 zoo_1.8-6 scales_1.0.0 SummarizedExperiment_1.12.0 [57] RColorBrewer_1.1-2 yaml_2.2.0 reticulate_1.12 pbapply_1.4-0 gridExtra_2.3 stringi_1.4.3 caTools_1.17.1.2 BiocParallel_1.16.6
[65] bibtex_0.4.2 Rdpack_0.11-0 SDMTools_1.1-221.1 rlang_0.4.0 pkgconfig_2.0.2 bitops_1.0-6 matrixStats_0.54.0 lattice_0.20-38
[73] ROCR_1.0-7 purrr_0.3.2 GenomicAlignments_1.18.1 htmlwidgets_1.3 bit_1.1-14 cowplot_1.0.0 tidyselect_0.2.5 RcppAnnoy_0.0.12
[81] plyr_1.8.4 magrittr_1.5 R6_2.4.0 gplots_3.0.1.1 DelayedArray_0.8.0 pillar_1.4.2 withr_2.1.2 fitdistrplus_1.0-14
[89] survival_2.44-1.1 RCurl_1.95-4.12 tibble_2.1.3 future.apply_1.3.0 tsne_0.1-3 crayon_1.3.4 hdf5r_1.2.0 KernSmooth_2.23-15
[97] plotly_4.9.0 grid_3.5.2 data.table_1.12.2 metap_1.1 digest_0.6.20 tidyr_0.8.3 R.utils_2.9.0 munsell_0.5.0
[105] viridisLite_0.3.0

timoast commented 5 years ago

Can you try the suggestions in https://github.com/satijalab/seurat/issues/1918 and see if they fix the problem?

drbecavin commented 5 years ago

Thanks ! I have tried that:

peaks <- Read10X_h5("atacseq10x_pig_wt/filtered_peak_bc_matrix.h5")
rownames(peaks) <- paste("chr", rownames(peaks), sep="")
chromosomes = c(1:18, "X", "Y")
chromosomes = paste("chr", chromosomes, sep="")
activity.matrix <- CreateGeneActivityMatrix(peak.matrix = peaks, annotation.file = "Genome/Sscrofa11.1.91.scAtacSeq.v2.gtf", seq.levels = chromosomes, upstream = 2000, verbose = TRUE, include.body = FALSE)

With no success ...

drbecavin commented 5 years ago

What about the

Extend()

function ? Where does it come from ?

timoast commented 5 years ago

Can you print the first few lines of your gtf file and the output of head(rownames(peaks))?

Extend is defined in Seurat but not exported.

drbecavin commented 5 years ago

First few lines of GTF used by Cellranger-atac:

1 ensembl transcript 3472 18546 . - . gene_id "ENSSSCG00000037372"; gene_version "1"; transcript_id "ENSSSCT00000065539"; transcript_version "1"; gene_source "ensembl"; gene_type "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; gene_name "ENSSSCG00000037372"; 1 ensembl transcript 3472 18546 . - . gene_id "ENSSSCG00000037372"; gene_version "1"; transcript_id "ENSSSCT00000042326"; transcript_version "1"; gene_source "ensembl"; gene_type "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; gene_name "ENSSSCG00000037372"; 1 ensembl transcript 5659 18696 . - . gene_id "ENSSSCG00000037372"; gene_version "1"; transcript_id "ENSSSCT00000046027"; transcript_version "1"; gene_source "ensembl"; gene_type "protein_coding"; transcript_source "ensembl"; transcript_biotype "protein_coding"; gene_name "ENSSSCG00000037372"; 1 ensembl transcript 23368 39931 . + . gene_id "ENSSSCG00000027257"; gene_version "2"; transcript_id "ENSSSCT00000058189"; transcript_version "1"; gene_name "PSMB1"; gene_source "ensembl"; gene_type "protein_coding"; transcript_name "PSMB1-202"; transcript_source "ensembl"; transcript_biotype "protein_coding"; 1 ensembl transcript 23853 40113 . + . gene_id "ENSSSCG00000027257"; gene_version "2"; transcript_id "ENSSSCT00000023500"; transcript_version "2"; gene_name "PSMB1"; gene_source "ensembl"; gene_type "protein_coding"; transcript_name "PSMB1-201"; transcript_source "ensembl"; transcript_biotype "protein_coding";

and output of head(rownames(peaks))

[1] "1:4427-5058" "1:21725-22544" "1:23302-24212" "1:43674-44167" "1:54085-54489" "1:67850-70422"

I have tried this to add manually the UCSC format = "chr" in front of chromosomes

peaks_pig_wt <- Read10X_h5("atacseq10x_pig_wt/filtered_peak_bc_matrix.h5")
rownames(peaks_pig_wt) <- paste("chr", rownames(peaks_pig_wt), sep="")
peaks_new <- peaks_pig_wt
head(rownames(peaks_pig_wt))
dimnames(peaks_new)[[1]] <- paste("chr", dimnames(peaks_pig_wt)[[1]], sep="")

chromosomes = c(1:18, "X", "Y")
chromosomes = paste("chr", chromosomes, sep="")
activity.matrix <- CreateGeneActivityMatrix(peak.matrix = peaks_pig_wt, annotation.file = "/home/becavin/Genome/Sscrofa11.1.91.scAtacSeq.v2.gtf", seq.levels = chromosomes, upstream = 2000, verbose = TRUE)

Still not working with error:

Error in $<-.data.frame(*tmp*, "peak", value = ":-") : replacement has 1 row, data has 0

timoast commented 5 years ago

Do you have a link to the gtf file?

drbecavin commented 5 years ago

My GTF is extracted from this one: ftp://ftp.ensembl.org/pub/release-91/gtf/sus_scrofa/

I have parsed it to make it compliant with cellranger-atac software. The parsing being :

I am running Cellranger-atac again with new annotation with chromosome names having "chr" in their name. Just to test !

drbecavin commented 5 years ago

Ok, Solved !!! I have found 3 bugs/blocking instruction There is 3 thing in your function Seurat::CreateGeneActivityMatrix which make it crashes with custom annotation.

Added for fixing rows without "gene_name" attribute in GTF file

gene.ids$gene_name[is.na(gene.ids$gene_name)] = gene.ids$gene_id[is.na(gene.ids$gene_name)]

peak.ids$gene.name <- gene.ids$gene_name



(I can submit a pull request tomorrow with these fix if you want. I think it will help many people wanting to try Seurat for scATACSeq with custom reference, before we all switch to Signac.)