stuart-lab / signac

R toolkit for the analysis of single-cell chromatin data
https://stuartlab.org/signac/
Other
327 stars 88 forks source link

Error in CreateFragmentObject(path = "", : Not all cells requested could be found in the fragment file. #848

Closed hchintalapudi closed 3 years ago

hchintalapudi commented 3 years ago

Hello, I was following the merge vignette and I get an error when I try to create a fragment object.

Here is my code:

plan("multiprocess", workers = 4)
options(future.globals.maxSize = 50000 * 1024^2) # for 50 Gb RAM

# read in peak sets
peaks_zmel_t.1 <- read.table(
  file = "/Users/hchintalapudi/Desktop/work/RO1/ZMEL_transplant/atac_peaks.bed",
  col.names = c("chr", "start", "end")
)
peaks_zmel_c.1 <- read.table(
  file = "/Users/hchintalapudi/Desktop/work/RO1/ZMEL_cultured/atac_peaks.bed",
  col.names = c("chr", "start", "end")
)
# convert to genomic ranges
gr.zmel_t <- makeGRangesFromDataFrame(peaks_zmel_t.1)
gr.zmel_c <- makeGRangesFromDataFrame(peaks_zmel_c.1)

# Create a unified set of peaks to quantify in each dataset
combined.peaks <- reduce(x = c(gr.zmel_c, gr.zmel_t))

# Filter out bad peaks based on length
peakwidths <- width(combined.peaks)
combined.peaks <- combined.peaks[peakwidths  < 10000 & peakwidths > 20]
combined.peaks

# load metadata
md.zmel_t <- read.table(
  file = "/Users/hchintalapudi/Desktop/work/RO1/ZMEL_transplant/per_barcode_metrics.csv",
  stringsAsFactors = FALSE,
  sep = ",",
  header = TRUE,
  row.names = 1
)[-1, ] # remove the first row

md.zmel_c <- read.table(
  file = "/Users/hchintalapudi/Desktop/work/RO1/ZMEL_cultured/per_barcode_metrics.csv",
  stringsAsFactors = FALSE,
  sep = ",",
  header = TRUE,
  row.names = 1
)[-1, ] # remove the first row

# perform an initial filtering of low count cells
# ** skipping this filtering step as the file is different for multiome than for tradional scATACseq file
#md.zmel_t <- md.zmel_t[md.zmel_t$passed_filters > 500, ]
#md.1k <- md.1k[md.1k$passed_filters > 500, ]

# create fragment objects
frags.zmel_t <- CreateFragmentObject(
  path = "/Users/hchintalapudi/Desktop/work/RO1/ZMEL_transplant/atac_fragments.tsv.gz",
  cells = rownames(md.zmel_t)
)
frags.zmel_c <- CreateFragmentObject(
  path = "/Users/hchintalapudi/Desktop/work/RO1/ZMEL_cultured/atac_fragments.tsv.gz",
  cells = rownames(md.zmel_c)
)
Computing hash
Checking for 724300 cell barcodes
Error in CreateFragmentObject(path = "/Users/hchintalapudi/Desktop/work/RO1/ZMEL_cultured/atac_fragments.tsv.gz",  : 
  Not all cells requested could be found in the fragment file.

My unzipped fragment file:


gzip -d atac_fragments.tsv.gz
 head -n 50 atac_fragments.tsv
# id=ZMEL-cultured
# description=ZMEL-cultured
#
# pipeline_name=cellranger-arc
# pipeline_version=cellranger-arc-2.0.0
#
# reference_path=/dartfs-hpc/rc/lab/G/GSR_Active/scripts/refs/GRCz11/GRCz11-ARC
# reference_fasta_hash=2e0a72b12f511f0282b8460f7435f6851e425136
# reference_gtf_hash=0fe4db37fa5a757053577211f53df51afe60bfdf
# reference_version=
# mkref_version=cellranger-arc-2.0.0
#
# primary_contig=1
# primary_contig=10
# primary_contig=11
# primary_contig=12
# primary_contig=13
# primary_contig=14
# primary_contig=15
# primary_contig=16
# primary_contig=17
# primary_contig=18
# primary_contig=19
# primary_contig=2
# primary_contig=20
# primary_contig=21
# primary_contig=22
# primary_contig=23
# primary_contig=24
# primary_contig=25
# primary_contig=3
# primary_contig=4
# primary_contig=5
# primary_contig=6
# primary_contig=7
# primary_contig=8
# primary_contig=9
1   5   139 TAGCGCGGTATTGCAG-1  1
1   41  397 ATGTGAGAGGGTCTAT-1  1
1   89  153 TGAGCTTAGGTGCGGA-1  2
1   91  222 AACATTGTCACAGACT-1  1
1   91  445 AACATTGTCACAGACT-1  1
1   126 249 AGCTTAATCTAATCAG-1  2
1   126 523 ACTAACGGTTGCACAA-1  1
1   134 264 TGCTCAACAAGCTTAT-1  3
1   142 272 GGACCTCAGCCTCTGT-1  2
1   142 300 AGGCCCAGTAACTACG-1  1
1   142 514 AGGAACCAGTTTGGGT-1  1
1   153 338 TGTATCGCATTATGCG-1  1
1   155 322 CCCGCAACAACCGCCA-1  1

I see in #748 that you mention 5 columns being present in the fragment file and it looks like I have 5 columns. What could've gone wrong in my case?

Thank you.

timoast commented 3 years ago

Here you're checking for every possible barcode, since you did not do any filtering of the metadata table. In the merge vignette, we apply an initial count-based cutoff to retain a set of cell barcodes for later analysis. Since you include all barcodes here, many will be present at extremely low frequency in the fragment file, and so most are not able to be detected in the fragment file checks that we perform.

hchintalapudi commented 3 years ago

Right, I skipped that step as I intuitively did not know which field to use for filtering, as 'passed_filters' is not present in my metadata file. That's because I have multiome data and the format/fields are little different.

From my old scATAC data:

colnames(singlecell)
 [1] "duplicate"                        "chimeric"                         "unmapped"                        
 [4] "lowmapq"                          "mitochondrial"                    "passed_filters"                  
 [7] "cell_id"                          "is__cell_barcode"                 "TSS_fragments"                   
[10] "DNase_sensitive_region_fragments" "enhancer_region_fragments"        "promoter_region_fragments"       
[13] "on_target_fragments"              "blacklist_region_fragments"       "peak_region_fragments"           
[16] "peak_region_cutsites"   

Current MULTIome data:

colnames(md.zmel_c)
 [1] "gex_barcode"                       "atac_barcode"                      "is_cell"                          
 [4] "excluded_reason"                   "gex_raw_reads"                     "gex_mapped_reads"                 
 [7] "gex_conf_intergenic_reads"         "gex_conf_exonic_reads"             "gex_conf_intronic_reads"          
[10] "gex_conf_exonic_unique_reads"      "gex_conf_exonic_antisense_reads"   "gex_conf_exonic_dup_reads"        
[13] "gex_exonic_umis"                   "gex_conf_intronic_unique_reads"    "gex_conf_intronic_antisense_reads"
[16] "gex_conf_intronic_dup_reads"       "gex_intronic_umis"                 "gex_conf_txomic_unique_reads"     
[19] "gex_umis_count"                    "gex_genes_count"                   "atac_raw_reads"                   
[22] "atac_unmapped_reads"               "atac_lowmapq"                      "atac_dup_reads"                   
[25] "atac_chimeric_reads"               "atac_mitochondrial_reads"          "atac_fragments"                   
[28] "atac_TSS_fragments"                "atac_peak_region_fragments"        "atac_peak_region_cutsites"  

What's the best way to filter based on the metadata I have? Probably with "excluded_reason"? from CellRanger Thanks for your help.

timoast commented 3 years ago

Explanation of the cellranger ARC metadata outputs is here: https://support.10xgenomics.com/single-cell-multiome-atac-gex/software/pipelines/latest/output/per_barcode_metrics

"atac_fragments" would be the total counts per barcode

hchintalapudi commented 3 years ago

Aah! Thank you.