Closed bio-la closed 2 years ago
the fragment file is indexed using a hash of the classic 10x barcode, so have to play around with the metadata file
Can you clarify what you mean exactly?
I downloaded this data from GEO and was not able to reproduce this issue, my matrix contained many non-zero counts. However, the metadata file on GEO is called GSM5123951_X066-MP0C1W3_leukopak_perm-cells_tea_200M_atac_filtered_metadata.csv.gz
, did you modify the original file? If so please include all steps that you ran.
FYI, the function getPeaks
here can be replaced with StringToGRanges()
in Signac.
thanks for your quick reply! this is a snapshot of the fragment file
chr1 10067 10197 69c31d5e3ef411ebb8b442010a19c80f 6
chr1 10067 10407 69c783b23ef411ebb8b442010a19c80f 4
chr1 10073 10203 69c5efa23ef411ebb8b442010a19c80f 12
chr1 10078 10321 69c27dfe3ef411ebb8b442010a19c80f 4
chr1 10084 10192 69c1a5003ef411ebb8b442010a19c80f 2
chr1 10085 10221 69c093183ef411ebb8b442010a19c80f 1
chr1 10085 10279 69c05e023ef411ebb8b442010a19c80f 2
chr1 10085 10309 69c3e4783ef411ebb8b442010a19c80f 2
chr1 10090 10272 69c9f9b23ef411ebb8b442010a19c80f 2
chr1 10091 10278 69c198ee3ef411ebb8b442010a19c80f 2
I only changed the name of the file. I am puzzled, did you run the same code i posted? i can't produce a count matrix for any of the downloaded samples (i didn't modify anything of the other two ) can you share the code you used that worked for you? Thanks!
Here is a simplified example that produces a count matrix using the data from GEO:
library(Seurat)
library(Signac)
library(GenomicRanges)
setwd("~/issue_1228/")
s <- Read10X_h5("GSM5123952_X066-MP0C1W4_leukopak_perm-cells_tea_200M_cellranger-arc_filtered_feature_bc_matrix.h5")
s.meta <- read.table("GSM5123951_X066-MP0C1W3_leukopak_perm-cells_tea_200M_atac_filtered_metadata.csv.gz", sep=",", header=T)
gr <- StringToGRanges(rownames(s$Peaks), sep = c(":", "-"))
frags.s <- CreateFragmentObject(
path = "GSM5123951_X066-MP0C1W3_leukopak_perm-cells_tea_200M_atac_filtered_fragments.tsv.gz",
cells = s.meta$barcodes
)
s.counts <- FeatureMatrix(
fragments = frags.s,
features = head(gr, 1000),
cells = s.meta$barcodes
)
range(rowSums(s.counts))
# 0 22934
I'm still unclear on what you mean by this:
the fragment file is indexed using a hash of the classic 10x barcode, so have to play around with the metadata file
thanks, I have tried running this code with the original metadata file but i still get 0 counts. what version of Signac/Seurat/Granges are you running? (also, there is a typo and you are reading Read10X_h5("GSM5123952") but metadata and fragments from GSM5123951?)
Sorry for the poor choice of words. What I meant is that column 4 of the fragment file usually has the cellbarcode in the standard 10x format (or at least that's what i would expect), here the name of the cell resembles a hashed string.
the cell 69c31d5e3ef411ebb8b442010a19c80f
is CTAGGCGGTAGCTGCG-1
chr1 10067 10197 69c31d5e3ef411ebb8b442010a19c80f 6
by play around i mean use s.meta$barcodes
instead of calling the colnames on the peak matrix
(also, there is a typo and you are reading Read10X_h5("GSM5123952") but metadata and fragments from GSM5123951?)
That was a typo yes, but the count matrix is only use here to define the genomic ranges being quantified so it doesn't matter.
Here's my session info:
R version 4.1.3 (2022-03-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] stxKidney.SeuratData_0.1.0 pbmcsca.SeuratData_3.0.0 pbmcref.SeuratData_1.0.0
[4] pbmcMultiome.SeuratData_0.1.2 pbmc3k.SeuratData_3.1.4 panc8.SeuratData_3.0.2
[7] ifnb.SeuratData_3.1.0 hcabm40k.SeuratData_3.0.0 bmcite.SeuratData_0.3.0
[10] SeuratData_0.2.1 ggplot2_3.3.6 BSgenome.Hsapiens.UCSC.hg38_1.4.4
[13] BSgenome_1.62.0 rtracklayer_1.54.0 Biostrings_2.62.0
[16] XVector_0.34.0 EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.18.2
[19] AnnotationFilter_1.18.0 GenomicFeatures_1.46.1 AnnotationDbi_1.56.2
[22] Biobase_2.54.0 GenomicRanges_1.46.1 GenomeInfoDb_1.30.1
[25] IRanges_2.28.0 S4Vectors_0.32.4 BiocGenerics_0.40.0
[28] sp_1.5-0 SeuratObject_4.1.2 Seurat_4.1.1.9006
[31] Signac_1.8.0 testthat_3.1.4
loaded via a namespace (and not attached):
[1] utf8_1.2.2 reticulate_1.26 tidyselect_1.1.2 RSQLite_2.2.16
[5] htmlwidgets_1.5.4 grid_4.1.3 BiocParallel_1.28.3 Rtsne_0.16
[9] munsell_0.5.0 codetools_0.2-18 ica_1.0-3 future_1.28.0
[13] miniUI_0.1.1.1 withr_2.5.0 spatstat.random_2.2-0 colorspace_2.0-3
[17] progressr_0.11.0 filelock_1.0.2 knitr_1.40 rstudioapi_0.14
[21] ROCR_1.0-11 tensor_1.5 listenv_0.8.0 MatrixGenerics_1.6.0
[25] GenomeInfoDbData_1.2.7 polyclip_1.10-0 bit64_4.0.5 farver_2.1.1
[29] parallelly_1.32.1 vctrs_0.4.1 generics_0.1.3 xfun_0.33
[33] BiocFileCache_2.2.0 R6_2.5.1 hdf5r_1.3.5 bitops_1.0-7
[37] spatstat.utils_2.3-1 cachem_1.0.6 DelayedArray_0.20.0 assertthat_0.2.1
[41] promises_1.2.0.1 BiocIO_1.4.0 scales_1.2.1 rgeos_0.5-9
[45] gtable_0.3.1 globals_0.16.1 goftest_1.2-3 rlang_1.0.5
[49] RcppRoll_0.3.0 splines_4.1.3 lazyeval_0.2.2 spatstat.geom_2.4-0
[53] yaml_2.3.5 reshape2_1.4.4 abind_1.4-5 httpuv_1.6.6
[57] tools_4.1.3 ellipsis_0.3.2 spatstat.core_2.4-4 RColorBrewer_1.1-3
[61] ggridges_0.5.3 Rcpp_1.0.9 plyr_1.8.7 progress_1.2.2
[65] zlibbioc_1.40.0 purrr_0.3.4 RCurl_1.98-1.8 prettyunits_1.1.1
[69] rpart_4.1.16 deldir_1.0-6 pbapply_1.5-0 cowplot_1.1.1
[73] zoo_1.8-11 SummarizedExperiment_1.24.0 ggrepel_0.9.1 cluster_2.1.3
[77] magrittr_2.0.3 data.table_1.14.2 scattermore_0.8 lmtest_0.9-40
[81] RANN_2.6.1 ProtGenerics_1.26.0 fitdistrplus_1.1-8 matrixStats_0.62.0
[85] pkgload_1.3.0 evaluate_0.16 hms_1.1.2 patchwork_1.1.2
[89] mime_0.12 xtable_1.8-4 XML_3.99-0.10 gridExtra_2.3
[93] compiler_4.1.3 biomaRt_2.50.0 tibble_3.1.8 KernSmooth_2.23-20
[97] crayon_1.5.1 htmltools_0.5.3 mgcv_1.8-40 later_1.3.0
[101] tidyr_1.2.0 DBI_1.1.3 tweenr_2.0.1 dbplyr_2.2.1
[105] MASS_7.3-56 rappdirs_0.3.3 Matrix_1.5-1 brio_1.1.2
[109] cli_3.3.0 parallel_4.1.3 igraph_1.3.4 pkgconfig_2.0.3
[113] GenomicAlignments_1.30.0 plotly_4.10.0 spatstat.sparse_2.1-1 xml2_1.3.3
[117] stringr_1.4.1 digest_0.6.29 sctransform_0.3.4 RcppAnnoy_0.0.19
[121] spatstat.data_2.2-0 rmarkdown_2.15 leiden_0.4.3 fastmatch_1.1-3
[125] uwot_0.1.14 restfulr_0.0.15 curl_4.3.2 shiny_1.7.2
[129] Rsamtools_2.10.0 rjson_0.2.21 lifecycle_1.0.1 nlme_3.1-159
[133] jsonlite_1.8.0 viridisLite_0.4.1 fansi_1.0.3 pillar_1.8.1
[137] lattice_0.20-45 KEGGREST_1.34.0 fastmap_1.1.0 httr_1.4.4
[141] survival_3.4-0 glue_1.6.2 png_0.1-7 bit_4.0.4
[145] ggforce_0.3.4 stringi_1.7.8 blob_1.2.2 memoise_2.0.1
[149] dplyr_1.0.10 irlba_2.3.5 future.apply_1.9.1
apart from your Seurat_4.1.1.9006 i don't see discrepancies in the tool versions. I am running R 4.1.2 vs your 4.1.3 but i don't think this is a major issue. I am at a loss for an explanation here. I will try to create a minimal conda env with these precise packages versions (Signac, Granges, Seurat) and R version and check again.
One last thing, can you post the command you used to index the fragment file? I don't get any error with my tbi indices, just to leave no stone unturned...
thank you very much for you patience and support with this.
tabix -p bed <filename>
You could try re-downloading the files from GEO
downloaded files anew and recreated the tbi indices. No issue this time! I must have corrupted some of those moving from one directory to another.
thank you very much again for your help, I will close the issue now.
hi, thanks for the great tool! similarly to what was reported in #205 #224 i found that FeatureMatrix returns an empty count matrix after merging the peaks from 3 atac runs.
I am using the Teaseq dataset which provides, for each sample atac modality, an h5 with Peaks and GEX, a metadata file, and a compressed fragment file, with no index.
I generated the index using tabix, and i follow the merging vignette
I went down the rabbit hole until
GetCellsInRegion
which callsreads <- scanTabix(file = tabix, param = region)
for chunks of 1900 regions at a time, finding thatreads
returnscharacter(0)
at every assayed region. I am not sure how to interpret or to fix this. The cell barcodes in the fragment file are definitely the ones assayed in both Peaks and GEX (only, the fragment file is indexed using a hash of the classic 10x barcode, so have to play around with the metadata file )here is the code i am using
Changing the
sep
param tosep=c(":","-")
returnsError in intI(i, n = x@Dim[1], dn[[1]], give.dn = FALSE) : invalid character indexing
Omitting the cells params still returns a 0 count matrix.
Thanks for your help!