linesError: fragment file lines too long

ZivTQ commented 11 months ago

Hi，I have some probloms in Signac when running: NucleosomeSignal(object = obj)

I got Error: Done Processing 145 million linesError: fragment file lines too long Error in[.data.frame(counts, cells.keep > 0, c("mononucleosomal", "nucleosome_free")) : undefined columns selected

I try to change n in the function NucleosomeSignal, However, It is still not work. Does anyone know how to fix it? Thanks very much!

timoast commented 10 months ago

Hi, it looks like your fragment file may be incorrectly formatted. Can you show the first few lines of your fragment file?

ZivTQ commented 10 months ago

Hi, thanks for your reply! Here is the fragment file:

head(Fragments(sce_atac)[[5]]) chrom start end barcode readCount 1 1 10072 10326 TACTGAGGTCAAGTAT-1 1 2 1 10073 10215 GGTTATATCATTTGTC-1 1 3 1 10073 10321 TAGTCAATCATAGACC-1 1 4 1 10078 10309 AGCTAAACATTATGCG-1 2 5 1 10078 10309 GGATATTGTTGTTGTC-1 3 6 1 10078 10327 TTTAACGAGGCCGGAA-1 3

I wondered if it was because I had too many samples (eight samples merged), because when I did this step, I combined all the samples. So I run the process alone, and then there were no problems. As follows:

Annotation(A) <- annotations A<- NucleosomeSignal(object = A) A<- TSSEnrichment(object = A, fast = FALSE, verbose = T, assay = 'peaks')

Annotation(B) <- annotations B<- NucleosomeSignal(object = B) B<- TSSEnrichment(object = B, fast = FALSE, verbose = T, assay = 'peaks') ............

However, When I run CallPeaks(), the sample error occured.

peaks <- CallPeaks(object = sce_atac, group.by = "AnnotatedcellType",outdir = "./")

Error

Processing file /home/ATAC_data/A/atac_fragments.tsv.gz Keeping 10081 cell barcodes Splitting into 5 files Done Processing 153 million lines

Processing file /home/ATAC_data/B/atac_fragments.tsv.gz Keeping 0 cell barcodes Splitting into 5 files Done Processing 202 million lines

Processing file /home/ATAC_data/C/atac_fragments.tsv.gz Keeping 0 cell barcodes Splitting into 5 files Done Processing 145 million lines

Processing file /home/ATAC_data/D/atac_fragments.tsv.gz Keeping 0 cell barcodes Splitting into 5 files Done Processing 94 million lines

Processing file /home/ATAC_data/E/atac_fragments.tsv.gz Keeping 0 cell barcodes Splitting into 5 files Done Processing 140 million linesError: fragment file lines too long Error in SplitFragments(object = object, assay = assay, group.by = group.by, : Error: cannot open requested file

I am puzzled at the question. Hope to get your help. Thank you very much!

sessionInfo() R version 4.2.2 Patched (2022-11-10 r83330) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.5 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] stats4 grid stats graphics grDevices utils datasets methods
[9] base

other attached packages: [1] EnsDb.Hsapiens.v86_2.99.0 ensembldb_2.22.0 AnnotationFilter_1.22.0
[4] GenomicFeatures_1.50.4 AnnotationDbi_1.60.2 Biobase_2.58.0
[7] GenomicRanges_1.50.2 GenomeInfoDb_1.34.9 IRanges_2.32.0
[10] S4Vectors_0.36.2 BiocGenerics_0.44.0 dplyr_1.1.4
[13] ComplexHeatmap_2.14.0 ggpubr_0.6.0 qs_0.25.7
[16] clustree_0.5.1 ggraph_2.1.0 ggplot2_3.4.4
[19] Seurat_5.0.1 harmony_1.2.0 Rcpp_1.0.11
[22] Signac_1.12.9003 SeuratObject_5.0.1 sp_2.1-2

timoast commented 10 months ago

I'm not understanding...is the original issue with NucleosomeSignal now fixed? If there is a new error please include the full code you're running that results in an error.

ZivTQ commented 10 months ago

peaks <- CallPeaks(object = sce_atac, group.by = "AnnotatedcellType",outdir = "./") ####Error Processing file /home/ATAC_data/A/atac_fragments.tsv.gz Keeping 10081 cell barcodes Splitting into 5 files Done Processing 153 million lines

Processing file /home/ATAC_data/B/atac_fragments.tsv.gz Keeping 0 cell barcodes Splitting into 5 files Done Processing 202 million lines

Processing file /home/ATAC_data/C/atac_fragments.tsv.gz Keeping 0 cell barcodes Splitting into 5 files Done Processing 145 million lines

Processing file /home/ATAC_data/D/atac_fragments.tsv.gz Keeping 0 cell barcodes Splitting into 5 files Done Processing 94 million lines

Processing file /home/ATAC_data/E/atac_fragments.tsv.gz Keeping 0 cell barcodes Splitting into 5 files Done Processing 140 million linesError: fragment file lines too long Error in SplitFragments(object = object, assay = assay, group.by = group.by, : Error: cannot open requested file

Hi, Thanks you. The error occured when running CallPeaks

peaks <- CallPeaks(object = sce_atac, group.by = "AnnotatedcellType",outdir = "./")

Error

Processing file /home/ATAC_data/A/atac_fragments.tsv.gz Keeping 10081 cell barcodes Splitting into 5 files Done Processing 153 million lines

Processing file /home/ATAC_data/B/atac_fragments.tsv.gz Keeping 0 cell barcodes Splitting into 5 files Done Processing 202 million lines

Processing file /home/ATAC_data/C/atac_fragments.tsv.gz Keeping 0 cell barcodes Splitting into 5 files Done Processing 145 million lines

Processing file /home/ATAC_data/D/atac_fragments.tsv.gz Keeping 0 cell barcodes Splitting into 5 files Done Processing 94 million lines

Processing file /home/ATAC_data/E/atac_fragments.tsv.gz Keeping 0 cell barcodes Splitting into 5 files Done Processing 140 million linesError: fragment file lines too long Error in SplitFragments(object = object, assay = assay, group.by = group.by, : Error: cannot open requested file

timoast commented 10 months ago

please include the full code you're running that results in an error

ZivTQ commented 10 months ago

please include the full code you're running that results in an error

Sorry, All of the codes are here:

remotes::install_github("stuart-lab/signac", ref = "develop")
library(Signac)
library(Seurat)
library(GenomeInfoDb)
library(EnsDb.Hsapiens.v86)
library(BSgenome.Hsapiens.UCSC.hg38)
library(harmony)
library(tidyverse)
library(ggplot2)
library(ggpubr)
library(ggalluvial)

#1、load data ====================================
# load data A1
A1_counts <- Read10X_h5(filename = "/home/ks_ts/scATAC/A1/filtered_feature_bc_matrix.h5")
A1_counts <- A1_counts[[2]]
A1_assay <- CreateChromatinAssay(counts = A1_counts,
                                  sep = c(":", "-"),
                                  genome = 'hg38',
                                  fragments = '/home/ks_ts/ATAC/A1/atac_fragments.tsv.gz',
                                  min.cells = 10,
                                  min.features = 200)
A1 <- CreateSeuratObject(A1_assay, assay = 'peaks',project = 'A1')
rm(A1_assay, A1_counts)

# load data A2
A2_counts <- Read10X_h5(filename = "/home/ks_ts/scATAC/A2/filtered_feature_bc_matrix.h5")
A2_counts <- A2_counts[[2]]
A2_assay <- CreateChromatinAssay(counts = A2_counts,
                                  sep = c(":", "-"),
                                  genome = 'hg38',
                                  fragments = '/home/ks_ts/ATAC/A2/atac_fragments.tsv.gz',
                                  min.cells = 10,
                                  min.features = 200)
A2 <- CreateSeuratObject(A2_assay, assay = 'peaks',project = 'A2')
rm(A2_assay, A2_counts)

# load data A3
A3_counts <- Read10X_h5(filename = "/home/ks_ts/scATAC/A3/filtered_feature_bc_matrix.h5")
A3_counts <- A3_counts[[2]]
A3_assay <- CreateChromatinAssay(counts = A3_counts,
                                  sep = c(":", "-"),
                                  genome = 'hg38',
                                  fragments = '/home/ks_ts/ATAC/A3/atac_fragments.tsv.gz',
                                  min.cells = 10,
                                  min.features = 200)
A3 <- CreateSeuratObject(A3_assay, assay = 'peaks',project = 'A3')
rm(A3_assay, A3_counts)

# load data A4
A4_counts <- Read10X_h5(filename = "/home/ks_ts/scATAC/A4/filtered_feature_bc_matrix.h5")
A4_counts <- A4_counts[[2]]
A4_assay <- CreateChromatinAssay(counts = A4_counts,
                                  sep = c(":", "-"),
                                  genome = 'hg38',
                                  fragments = '/home/ks_ts/ATAC/A4/atac_fragments.tsv.gz',
                                  min.cells = 10,
                                  min.features = 200)
A4 <- CreateSeuratObject(A4_assay, assay = 'peaks',project = 'A4')
rm(A4_assay, A4_counts)

# load data B1
B1_counts <- Read10X_h5(filename = "/home/ks_ts/scATAC/TA1/filtered_feature_bc_matrix.h5")
B1_counts <- B1_counts[[2]]
B1_assay <- CreateChromatinAssay(counts = B1_counts,
                                   sep = c(":", "-"),
                                   genome = 'hg38',
                                   fragments = '/home/ks_ts/ATAC/B1/atac_fragments.tsv.gz',
                                   min.cells = 10,
                                   min.features = 200)
B1 <- CreateSeuratObject(B1_assay, assay = 'peaks',project = 'B1')
rm(B1_assay, B1_counts)

# load data B2
B2_counts <- Read10X_h5(filename = "/home/ks_ts/scATAC/TA2/filtered_feature_bc_matrix.h5")
B2_counts <- B2_counts[[2]]
B2_assay <- CreateChromatinAssay(counts = B2_counts,
                                   sep = c(":", "-"),
                                   genome = 'hg38',
                                   fragments = '/home/ks_ts/ATAC/B2/atac_fragments.tsv.gz',
                                   min.cells = 10,
                                   min.features = 200)
B2 <- CreateSeuratObject(B2_assay, assay = 'peaks',project = 'B2')
rm(B2_assay, B2_counts)

# load data B3
B3_counts <- Read10X_h5(filename = "/home/ks_ts/scATAC/TA3/filtered_feature_bc_matrix.h5")
B3_counts <- B3_counts[[2]]
B3_assay <- CreateChromatinAssay(counts = B3_counts,
                                   sep = c(":", "-"),
                                   genome = 'hg38',
                                   fragments = '/home/ks_ts/ATAC/B3/atac_fragments.tsv.gz',
                                   min.cells = 10,
                                   min.features = 200)
B3 <- CreateSeuratObject(B3_assay, assay = 'peaks',project = 'B3')
rm(B3_assay, B3_counts)

# load data B4
B4_counts <- Read10X_h5(filename = "/home/ks_ts/scATAC/TA4/filtered_feature_bc_matrix.h5")
B4_counts <- B4_counts[[2]]
B4_assay <- CreateChromatinAssay(counts = B4_counts,
                                   sep = c(":", "-"),
                                   genome = 'hg38',
                                   fragments = '/home/ks_ts/ATAC/B4/atac_fragments.tsv.gz',
                                   min.cells = 10,
                                   min.features = 200)
B4 <- CreateSeuratObject(B4_assay, assay = 'peaks',project = 'B4')
rm(B4_assay, B4_counts)

#2、preprocess of data ====================================

annotations <- GetGRangesFromEnsDb(ensdb = EnsDb.Hsapiens.v86)
# seqlevelsStyle(annotations) <- 'UCSC'
genome(annotations) <- "hg38"
Annotation(A1) <- annotations
A1 <- NucleosomeSignal(object = A1)
A1 <- TSSEnrichment(object = A1, fast = FALSE, verbose = T, assay = 'peaks')

Annotation(A2) <- annotations
A2 <- NucleosomeSignal(object = A2)
A2 <- TSSEnrichment(object = A2, fast = FALSE, verbose = T, assay = 'peaks')

Annotation(A3) <- annotations
A3 <- NucleosomeSignal(object = A3)
A3 <- TSSEnrichment(object = A3, fast = FALSE, verbose = T, assay = 'peaks')

Annotation(A4) <- annotations
A4 <- NucleosomeSignal(object = A4)
A4 <- TSSEnrichment(object = A4, fast = FALSE, verbose = T, assay = 'peaks')

Annotation(B1) <- annotations
B1 <- NucleosomeSignal(object = B1)
B1 <- TSSEnrichment(object = B1, fast = FALSE, verbose = T, assay = 'peaks')

Annotation(B2) <- annotations
B2 <- NucleosomeSignal(object = B2)
B2 <- TSSEnrichment(object = B2, fast = FALSE, verbose = T, assay = 'peaks')

Annotation(B3) <- annotations
B3 <- NucleosomeSignal(object = B3)
B3 <- TSSEnrichment(object = B3, fast = FALSE, verbose = T, assay = 'peaks')

Annotation(B4) <- annotations
B4 <- NucleosomeSignal(object = B4)
B4 <- TSSEnrichment(object = B4, fast = FALSE, verbose = T, assay = 'peaks')

# merge all datasets
combined <- merge(x = A1,
                  y = list(A2, A3, A4, B1, B2, B3, B4),
                  add.cell.ids = c("A1", "A2", "A3", "A4", "B1", "B2", "B3", "B4"))
save(combined, file = "combined.RData")
rm(A1,A2, A3, A4, B1, B2, B3, B4)

combined$blacklist_fraction <- FractionCountsInRegion(object = combined, 
                                                      assay = 'peaks',
                                                      regions = blacklist_hg38)

# combined$pct_reads_in_peaks <- combined$peak_region_fragments / combined$passed_filters * 100
# combined$blacklist_ratio <- combined$blacklist_region_fragments / combined$peak_region_fragments
combined$high.tss <- ifelse(combined$TSS.enrichment > 2, 'High', 'Low')
combined$nucleosome_group <- ifelse(combined$nucleosome_signal > 4, 'NS > 4', 'NS < 4')

TSSPlot(combined, group.by = 'high.tss', assay = 'peaks') + NoLegend()
FragmentHistogram(object = combined, group.by = 'nucleosome_group')

#3、cluster ===================================
combined <- RunTFIDF(combined)
combined <- FindTopFeatures(combined, min.cutoff = 'q0')
combined <- RunSVD(combined)
library(harmony)
combined <- RunHarmony(object = combined,
                       group.by.vars = 'orig.ident',
                       reduction.use = 'lsi',
                       assay.use = 'peaks',
                       project.dim = FALSE)

combined <- RunUMAP(combined, dims = 2:30, reduction = "harmony")
combined<- FindNeighbors(object = combined,reduction = "harmony",dims = 2:30)
combined<- FindClusters(object = combined, algorithm = 3,resolution = seq(0.2, 1.2, by = 0.1),verbose = FALSE)

library(clustree)
clustree(combined)

DimPlot(object = combined, reduction = 'umap',label = TRUE) + NoLegend()
DimPlot(object = combined, reduction = 'umap',label = TRUE, split.by = 'orig.ident', ncol = 4) + NoLegend()
# save(combined, file = "combined_cluster.RData")

#add gene activatied matrix
gene.activities <- GeneActivity(combined)
# add the gene activity matrix to the Seurat object as a new assay and normalize it
combined[['ACTIVITY']] <- CreateAssayObject(counts = gene.activities)
combined <- NormalizeData(
  object = combined,
  assay = 'ACTIVITY',
  normalization.method = 'LogNormalize',
  scale.factor = median(combined$nCount_ACTIVITY)
)
# save(combined, file = "combined_cluster.RData")

#4、integrate.scATAC.scRNA ====================================
#load scRNA data
library(qs)
sce_RNA <- qs::qread("./sce_anno.qs")

#ATAC data
combined$seurat_clusters <- combined$peaks_snn_res.1.2
Idents(combined) <- combined$seurat_clusters

DefaultAssay(combined) <- "ACTIVITY"
sce_RNA <- FindVariableFeatures(object = sce_RNA,nfeatures = 5000)
transfer.anchors <- FindTransferAnchors(reference = sce_RNA ,
                                        query = combined,reduction = 'cca',dims = 1:30)

predicted.labels <- TransferData(anchorset = transfer.anchors,
                                 refdata = sce_RNA$celltype,
                                 weight.reduction = combined[['lsi']],
                                 dims = 2:30)
combined <- AddMetaData(object = combined, metadata = predicted.labels)
# save(combined, file = "combined_cluster.RData")
colnames(combined@meta.data)[25] <- 'celltype'

#=5、recluster====================================
#### recluster
DefaultAssay(sce_ATAC) <- "peaks"
sce_ATAC <- RunTFIDF(sce_ATAC)
sce_ATAC <- FindTopFeatures(sce_ATAC)
sce_ATAC <- RunSVD(sce_ATAC)
library(harmony)
sce_ATAC <- RunHarmony(object = sce_ATAC,
                       group.by.vars = 'orig.ident',
                       reduction.use = 'lsi',
                       assay.use = 'peaks',
                       project.dim = FALSE)

sce_ATAC <- RunUMAP(sce_ATAC, dims = 2:15, reduction = "harmony")
sce_ATAC<- FindNeighbors(object = sce_ATAC,reduction = "harmony",dims = 2:15)
sce_ATAC<- FindClusters(object = sce_ATAC, algorithm = 3,resolution = seq(0.2, 1.2, by = 0.1),verbose = FALSE)

clustree(sce_ATAC)
sce_ATAC$seurat_clusters <- sce_ATAC$peaks_snn_res.0.3
DimPlot(sce_ATAC, label = T, repel = T, group.by = 'peaks_snn_res.0.4')
DimPlot(sce_ATAC, label = T, repel = T, group.by = 'celltype')

save(sce_ATAC, file = 'sce_ATAC.RData')

#6、peaks analysis====================================
DefaultAssay(sce_ATAC) <- "peaks"
Idents(sce_ATAC) <- 'celltype'

peaks <- CallPeaks(
  object = sce_ATAC,
  group.by = "celltype",
  outdir = "./",
  macs2.path = "/home/ks_ts/miniconda3/bin/macs2"
)`

In the last line, CallPeaks backed error:

`Processing file /home/ATAC_data/A1/atac_fragments.tsv.gz
Keeping 10081 cell barcodes
Splitting into 5 files
Done Processing 153 million lines

Processing file /home/ATAC_data/A2/atac_fragments.tsv.gz
Keeping 0 cell barcodes
Splitting into 5 files
Done Processing 202 million lines

Processing file /home/ATAC_data/A3/atac_fragments.tsv.gz
Keeping 0 cell barcodes
Splitting into 5 files
Done Processing 145 million lines

Processing file /home/ATAC_data/A4/atac_fragments.tsv.gz
Keeping 0 cell barcodes
Splitting into 5 files
Done Processing 94 million lines

Processing file /home/ATAC_data/A5/atac_fragments.tsv.gz
Keeping 0 cell barcodes
Splitting into 5 files
Done Processing 140 million linesError: fragment file lines too long
Error in SplitFragments(object = object, assay = assay, group.by = group.by, :
                          Error: cannot open requested file
`

My packages version:
`R version 4.2.2 Patched (2022-11-10 r83330)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

Random number generation:
 RNG:     L'Ecuyer-CMRG 
 Normal:  Inversion 
 Sample:  Rejection 

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] grid      stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] rhdf5_2.42.1                      Matrix_1.6-4                     
 [3] data.table_1.14.10                plyr_1.8.9                       
 [5] magrittr_2.0.3                    gtable_0.3.4                     
 [7] gtools_3.9.5                      gridExtra_2.3                    
 [9] ArchR_1.0.2                       ggalluvial_0.12.5                
[11] ggpubr_0.6.0                      lubridate_1.9.3                  
[13] forcats_1.0.0                     stringr_1.5.1                    
[15] dplyr_1.1.4                       purrr_1.0.2                      
[17] readr_2.1.4                       tidyr_1.3.0                      
[19] tibble_3.2.1                      tidyverse_2.0.0                  
[21] harmony_1.2.0                     Rcpp_1.0.12                      
[23] BSgenome.Hsapiens.UCSC.hg38_1.4.5 BSgenome_1.66.3                  
[25] rtracklayer_1.58.0                Biostrings_2.66.0                
[27] XVector_0.38.0                    EnsDb.Hsapiens.v86_2.99.0        
[29] ensembldb_2.22.0                  AnnotationFilter_1.22.0          
[31] GenomicFeatures_1.50.4            AnnotationDbi_1.60.2             
[33] Signac_1.12.9003                  SingleR_2.0.0                    
[35] SummarizedExperiment_1.28.0       Biobase_2.58.0                   
[37] GenomicRanges_1.50.2              GenomeInfoDb_1.34.9              
[39] IRanges_2.32.0                    S4Vectors_0.36.2                 
[41] BiocGenerics_0.44.0               MatrixGenerics_1.10.0            
[43] ggplot2_3.4.4                     Seurat_5.0.1                     
[45] matrixStats_1.1.0                 SeuratObject_5.0.1               
[47] sp_2.1-2

Thank you

timoast commented 10 months ago

Are you sure you are changing the correct column name when running colnames(combined@meta.data)[25] <- 'celltype'?

A safer way to do this is combined$celltype <- combined$predicted.id

It's weird that only the first fragment file seems to contain any cells for the 5 cell types.

I think it's possible that ~140 million lines into the 5th fragment file there is an issue with one of the lines

ZivTQ commented 10 months ago

Are you sure you are changing the correct column name when running colnames(combined@meta.data)[25] <- 'celltype'?

A safer way to do this is combined$celltype <- combined$predicted.id

It's weird that only the first fragment file seems to contain any cells for the 5 cell types.

I think it's possible that ~140 million lines into the 5th fragment file there is an issue with one of the lines

@timoast

Thank you very much! We re-ran cellranger atac, and obtain the corrected fargment.tsv.gz files. This problems resolved!

Best wishes!

stuart-lab / signac