sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
273 stars 67 forks source link

Encountering unpaired read upon analyzing SmartSeq2 dataset in NCBI SRA #313

Closed kbattenb closed 2 years ago

kbattenb commented 2 years ago

Hi zUMIs folks,

I am trying to troubleshoot a situation that I have ran into and would like some help. I apologize for my long entry in advance.

Input conditions (here is what I have at hand)

zUMIs version:

zUMIs command to execute:

/home/zUMIs/zUMIs.sh -c -y /home/workbench/config_files/config_for_SmartSeq2.yaml

Input data:

Reference:

zUMIs configuration file:

Output (Here is what I get for standard output)

# /home/zUMIs/zUMIs.sh -c -y /home/workbench/config_files/config_for_SmartSeq2.yaml Using miniconda environment for zUMIs! note: internal executables will be used instead of those specified in the YAML file!

You provided these parameters: YAML file: /home/workbench/config_files/config_for_SmartSeq2.yaml zUMIs directory: /home/zUMIs STAR executable STAR samtools executable samtools pigz executable pigz Rscript executable Rscript RAM limit: null zUMIs version 2.9.7

Sat Apr 9 22:26:51 JST 2022 WARNING: The STAR version used for mapping is 2.7.3a and the STAR index was created using the version 2.7.4a. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.3a. Filtering... Sun Apr 10 02:14:14 JST 2022 [1] "3752 barcodes detected." [1] "5699177 reads were assigned to barcodes that do not correspond to intact cells." Mapping... [1] "2022-04-10 02:14:35 JST" STAR --readFilesCommand samtools view -@ 1 --outSAMmultNmax 1 --outFilterMultimapNmax 50 --outSAMunmapped Within --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --genomeDir /home/workbench/Reference/Mouse --sjdbGTFfile /home/workbench/Reference/Mus_musculus.GRCm39.105.gtf --runThreadN 1 --readFilesType SAM PE --genomeSAindexNbases 11 --limitOutSJcollapsed 5000000 --twopassMode Basic --readFilesIn /home/workbench/OUTPUT/zUMIs_output/.tmpMerge//MouseSmartSeq2.MouseSmartSeq2ae.filtered.tagged.bam,/home/workbench/OUTPUT/zUMIs_output/.tmpMerge//MouseSmartSeq2.MouseSmartSeq2af.filtered.tagged.bam --outFileNamePrefix /home/workbench/OUTPUT/zUMIs_output/.tmpMap//tmp.MouseSmartSeq2.3. STAR version: 2.7.9a compiled: 2021-07-01T11:54:56+09:00 a524ed1d99de:/home/STAR-2.7.9a/source Apr 10 02:14:40 ..... started STAR run Apr 10 02:14:41 ..... loading genome STAR --readFilesCommand samtools view -@ 1 --outSAMmultNmax 1 --outFilterMultimapNmax 50 --outSAMunmapped Within --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --genomeDir /home/workbench/Reference/Mouse --sjdbGTFfile /home/workbench/Reference/Mus_musculus.GRCm39.105.gtf --runThreadN 1 --readFilesType SAM PE --genomeSAindexNbases 11 --limitOutSJcollapsed 5000000 --twopassMode Basic --readFilesIn /home/workbench/OUTPUT/zUMIs_output/.tmpMerge//MouseSmartSeq2.MouseSmartSeq2aa.filtered.tagged.bam,/home/workbench/OUTPUT/zUMIs_output/.tmpMerge//MouseSmartSeq2.MouseSmartSeq2ab.filtered.tagged.bam --outFileNamePrefix /home/workbench/OUTPUT/zUMIs_output/.tmpMap//tmp.MouseSmartSeq2.1. STAR version: 2.7.9a compiled: 2021-07-01T11:54:56+09:00 a524ed1d99de:/home/STAR-2.7.9a/source Apr 10 02:14:40 ..... started STAR run Apr 10 02:14:41 ..... loading genome STAR --readFilesCommand samtools view -@ 1 --outSAMmultNmax 1 --outFilterMultimapNmax 50 --outSAMunmapped Within --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --genomeDir /home/workbench/Reference/Mouse --sjdbGTFfile /home/workbench/Reference/Mus_musculus.GRCm39.105.gtf --runThreadN 1 --readFilesType SAM PE --genomeSAindexNbases 11 --limitOutSJcollapsed 5000000 --twopassMode Basic --readFilesIn /home/workbench/OUTPUT/zUMIs_output/.tmpMerge//MouseSmartSeq2.MouseSmartSeq2ac.filtered.tagged.bam,/home/workbench/OUTPUT/zUMIs_output/.tmpMerge//MouseSmartSeq2.MouseSmartSeq2ad.filtered.tagged.bam --outFileNamePrefix /home/workbench/OUTPUT/zUMIs_output/.tmpMap//tmp.MouseSmartSeq2.2. STAR version: 2.7.9a compiled: 2021-07-01T11:54:56+09:00 a524ed1d99de:/home/STAR-2.7.9a/source Apr 10 02:14:40 ..... started STAR run Apr 10 02:14:41 ..... loading genome STAR --readFilesCommand samtools view -@ 1 --outSAMmultNmax 1 --outFilterMultimapNmax 50 --outSAMunmapped Within --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --genomeDir /home/workbench/Reference/Mouse --sjdbGTFfile /home/workbench/Reference/Mus_musculus.GRCm39.105.gtf --runThreadN 1 --readFilesType SAM PE --genomeSAindexNbases 11 --limitOutSJcollapsed 5000000 --twopassMode Basic --readFilesIn /home/workbench/OUTPUT/zUMIs_output/.tmpMerge//MouseSmartSeq2.MouseSmartSeq2ag.filtered.tagged.bam,/home/workbench/OUTPUT/zUMIs_output/.tmpMerge//MouseSmartSeq2.MouseSmartSeq2ah.filtered.tagged.bam --outFileNamePrefix /home/workbench/OUTPUT/zUMIs_output/.tmpMap//tmp.MouseSmartSeq2.4. STAR version: 2.7.9a compiled: 2021-07-01T11:54:56+09:00 a524ed1d99de:/home/STAR-2.7.9a/source Apr 10 02:14:40 ..... started STAR run Apr 10 02:14:42 ..... loading genome Apr 10 02:25:21 ..... processing annotations GTF Apr 10 02:28:09 ..... inserting junctions into the genome indices Apr 10 02:39:54 ..... started 1st pass mapping

ReadAlignChunk_processChunks.cpp:55:processChunks EXITING because of FATAL ERROR in input BAM file: the consecutive lines in paired-end BAM have different read IDs: SRR9621775.283074845 vs

SOLUTION: fix BAM file formatting. Paired-end reads should be always consecutive lines, with exactly 2 lines per paired-end read Apr 10 02:39:55 ...... FATAL ERROR, exiting [main_cat] ERROR: input is not BAM or CRAM [main_cat] ERROR: input is not BAM or CRAM Sun Apr 10 02:43:50 JST 2022 Counting... [1] "2022-04-10 02:44:01 JST" $project [1] "MouseSmartSeq2"

$sequence_files $sequence_files$file1 $sequence_files$file1$name [1] "/home/workbench/Raw_data/SmartSeq2_S1_L001_R1_001.fastq.gz"

$sequence_files$file1$base_definition [1] "cDNA(1-76)"

$sequence_files$file2 $sequence_files$file2$name [1] "/home/workbench/Raw_data/SmartSeq2_S1_L001_R2_001.fastq.gz"

$sequence_files$file2$base_definition [1] "cDNA(1-76)"

$sequence_files$file3 $sequence_files$file3$name [1] "/home/workbench/Raw_data/SmartSeq2_S1_L001_I1_001.fastq.gz"

$sequence_files$file3$base_definition [1] "BC(1-8)"

$sequence_files$file4 $sequence_files$file4$name [1] "/home/workbench/Raw_data/SmartSeq2_S1_L001_I2_001.fastq.gz"

$sequence_files$file4$base_definition [1] "BC(1-8)"

$reference $reference$STAR_index [1] "/home/workbench/Reference/Mouse"

$reference$GTF_file [1] "/home/workbench/Reference/Mus_musculus.GRCm39.105.gtf"

$reference$exon_extension [1] FALSE

$reference$extension_length [1] 0

$reference$scaffold_length_min [1] 0

$out_dir [1] "/home/workbench/OUTPUT"

$num_threads [1] 7

$mem_limit [1] 100

$filter_cutoffs $filter_cutoffs$BC_filter $filter_cutoffs$BC_filter$num_bases [1] 1

$filter_cutoffs$BC_filter$phred [1] 20

$filter_cutoffs$UMI_filter $filter_cutoffs$UMI_filter$num_bases [1] 1

$filter_cutoffs$UMI_filter$phred [1] 20

$barcodes $barcodes$barcode_num NULL

$barcodes$automatic [1] FALSE

$barcodes$BarcodeBinning [1] 0

$barcodes$nReadsperCell [1] 1

$barcodes$demultiplex [1] FALSE

$counting_opts $counting_opts$introns [1] TRUE

$counting_opts$downsampling [1] "0"

$counting_opts$strand [1] 0

$counting_opts$Ham_Dist [1] 0

$counting_opts$velocyto [1] FALSE

$counting_opts$primaryHit [1] TRUE

$counting_opts$twoPass [1] TRUE

$counting_opts$write_ham [1] FALSE

$counting_opts$multi_overlap [1] FALSE

$counting_opts$intronProb [1] FALSE

$make_stats [1] TRUE

$which_Stage [1] "Filtering"

$read_layout [1] "PE"

$zUMIs_directory [1] "/home/zUMIs"

$samtools_exec [1] "samtools"

$pigz_exec [1] "pigz"

$STAR_exec [1] "STAR"

$Rscript_exec [1] "Rscript"

[1] "4.5e+08 Reads per chunk" [1] "Loading reference annotation from:" [1] "/home/workbench/OUTPUT/MouseSmartSeq2.final_annot.gtf" [E::hts_open_format] Failed to open file /home/workbench/OUTPUT/MouseSmartSeq2.filtered.tagged.Aligned.out.bam samtools view: failed to open "/home/workbench/OUTPUT/MouseSmartSeq2.filtered.tagged.Aligned.out.bam" for reading: No such file or directory [E::hts_open_format] Failed to open file /home/workbench/OUTPUT/MouseSmartSeq2.filtered.tagged.Aligned.out.bam samtools view: failed to open "/home/workbench/OUTPUT/MouseSmartSeq2.filtered.tagged.Aligned.out.bam" for reading: No such file or directory Error in gsub("SN:", "", chr) : object 'chr' not found Calls: .makeSAF ... .chromLengthFilter -> [ -> [.data.table -> eval -> eval -> gsub In addition: Warning message: In data.table::fread(bread, col.names = c("chr", "len"), header = F) : File '/tmp/RtmpfjPrwR/filed177cb9fbbe' has size 0. Returning a NULL data.table. Execution halted Sun Apr 10 02:44:17 JST 2022 Loading required package: yaml Loading required package: Matrix [1] "loomR found" Error in gzfile(file, "rb") : cannot open the connection Calls: rds_to_loom -> readRDS -> gzfile In addition: Warning message: In gzfile(file, "rb") : cannot open compressed file '/home/workbench/OUTPUT/zUMIs_output/expression/MouseSmartSeq2.dgecounts.rds', probable reason 'No such file or directory' Execution halted Sun Apr 10 02:44:19 JST 2022 Descriptive statistics... [1] "I am loading useful packages for plotting..." [1] "2022-04-10 02:44:19 JST" Error in gzfile(file, "rb") : cannot open the connection Calls: readRDS -> gzfile In addition: Warning message: In gzfile(file, "rb") : cannot open compressed file '/home/workbench/OUTPUT/zUMIs_output/expression/MouseSmartSeq2.dgecounts.rds', probable reason 'No such file or directory' Execution halted Sun Apr 10 02:44:23 JST 2022

Bug description

Apparently, there is an issue that is related to a specific read (SRR9621775.283074845) not being paired and this results in the "expression" folder in the output to be entirely empty. I have tried this thrice and this is reproducible. Obviously, I checked with the input FASTQ files to see if there is something the matter with them, but this doe not appears to be the case:

# zcat SmartSeq2_S1_L001_R1_001.fastq.gz | wc -l 1981700232 (The same for the other 3 files)

# zcat SmartSeq2_S1_L001_R1_001.fastq.gz | head -n 1132299380 | tail -n 4 @SRR9621775.283074845 D00224L:270:CCU0CANXX:7:1305:13112:64065 length=76 TGCTAAGATTTTGCGTAGCTGGGTTTGGTTTAATCCACCTCAACTGCCTGCTATGATGGATAAGATTGAGAGAGTG +SRR9621775.283074845 D00224L:270:CCU0CANXX:7:1305:13112:64065 length=76 0<BBBBF0FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFF (Matching for the other 3 files)

# zcat SmartSeq2_S1_L001_R1_001.fastq.gz | tail -n 4 @SRR9621775.495425058 D00224L:270:CCU0CANXX:8:2316:21278:101425 length=76 GTGGTATCAACGCAGAGTACGGGAAGCAGTGGTATCAACGCAGAGTACGGGAAGCAGTGGTATCAACGCAGAGTAC +SRR9621775.495425058 D00224L:270:CCU0CANXX:8:2316:21278:101425 length=76 <<BB<<FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FBFF0BFFBFBFFFFFB0 (Matching for the other 3 files)


I was not sure if the issue was with zUMIs or with the specific file in SRA, but I could not identify any obvious problems. Any suggestion would be very much appreciated.

All the best,

Kai Battenberg

cziegenhain commented 2 years ago

Hey Kai,

Thanks for having such a nice and complete error description, much appreciated!

My initial gut feeling for this issue: I think the SRA read IDs have been problematic before, could you try to do fastq-dump in this way: https://github.com/sdparekh/zUMIs/wiki/Reprocessing-of-public-data

Best, Christoph

kbattenb commented 2 years ago

Hi Christoph,

Thank you for your suggestion. Apparently it's not the first time SRA has caused issues and I should have looked into that.

I will re-download these files as per your suggestion and let you know if the situation improves.

Thank you again.

All the best,

Kai Battenberg

kbattenb commented 2 years ago

Hi Christoph,

I tried your suggestion and I believe it got further in the process, but it still did not complete.

Changes

Input data:

The options for the command by which the data was download was changed

As a result (as suggested), each read in the FASTQ file was changed

Output (Here is what I get for standard output)

# /home/zUMIs/zUMIs.sh -c -y /home/workbench/config_files/config_for_SmartSeq2.yaml Using miniconda environment for zUMIs! note: internal executables will be used instead of those specified in the YAML file!

You provided these parameters: YAML file: /home/workbench/config_files/config_for_SmartSeq2.yaml zUMIs directory: /home/zUMIs STAR executable STAR samtools executable samtools pigz executable pigz Rscript executable Rscript RAM limit: null zUMIs version 2.9.7

Fri Apr 15 08:08:56 JST 2022 WARNING: The STAR version used for mapping is 2.7.3a and the STAR index was created using the version 2.7.4a. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.3a. Filtering... Fri Apr 15 11:29:58 JST 2022 [1] "3752 barcodes detected." [1] "5699177 reads were assigned to barcodes that do not correspond to intact cells." Mapping... [1] "2022-04-15 11:30:32 JST" STAR --readFilesCommand samtools view -@ 2 --outSAMmultNmax 1 --outFilterMultimapNmax 50 --outSAMunmapped Within --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --genomeDir /home/workbench/Reference/Mouse --sjdbGTFfile /home/workbench/Reference/Mus_musculus.GRCm39.105.gtf --runThreadN 2 --readFilesType SAM PE --genomeSAindexNbases 11 --limitOutSJcollapsed 5000000 --twopassMode Basic --readFilesIn /home/workbench/OUTPUT/zUMIs_output/.tmpMerge//MouseSmartSeq2.MouseSmartSeq2aj.filtered.tagged.bam,/home/workbench/OUTPUT/zUMIs_output/.tmpMerge//MouseSmartSeq2.MouseSmartSeq2ak.filtered.tagged.bam --outFileNamePrefix /home/workbench/OUTPUT/zUMIs_output/.tmpMap//tmp.MouseSmartSeq2.4. STAR version: 2.7.9a compiled: 2021-07-01T11:54:56+09:00 a524ed1d99de:/home/STAR-2.7.9a/source Apr 15 11:30:51 ..... started STAR run Apr 15 11:30:51 ..... loading genome STAR --readFilesCommand samtools view -@ 2 --outSAMmultNmax 1 --outFilterMultimapNmax 50 --outSAMunmapped Within --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --genomeDir /home/workbench/Reference/Mouse --sjdbGTFfile /home/workbench/Reference/Mus_musculus.GRCm39.105.gtf --runThreadN 2 --readFilesType SAM PE --genomeSAindexNbases 11 --limitOutSJcollapsed 5000000 --twopassMode Basic --readFilesIn /home/workbench/OUTPUT/zUMIs_output/.tmpMerge//MouseSmartSeq2.MouseSmartSeq2aa.filtered.tagged.bam,/home/workbench/OUTPUT/zUMIs_output/.tmpMerge//MouseSmartSeq2.MouseSmartSeq2ab.filtered.tagged.bam,/home/workbench/OUTPUT/zUMIs_output/.tmpMerge//MouseSmartSeq2.MouseSmartSeq2ac.filtered.tagged.bam --outFileNamePrefix /home/workbench/OUTPUT/zUMIs_output/.tmpMap//tmp.MouseSmartSeq2.1. STAR version: 2.7.9a compiled: 2021-07-01T11:54:56+09:00 a524ed1d99de:/home/STAR-2.7.9a/source Apr 15 11:30:51 ..... started STAR run Apr 15 11:30:51 ..... loading genome STAR --readFilesCommand samtools view -@ 2 --outSAMmultNmax 1 --outFilterMultimapNmax 50 --outSAMunmapped Within --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --genomeDir /home/workbench/Reference/Mouse --sjdbGTFfile /home/workbench/Reference/Mus_musculus.GRCm39.105.gtf --runThreadN 2 --readFilesType SAM PE --genomeSAindexNbases 11 --limitOutSJcollapsed 5000000 --twopassMode Basic --readFilesIn /home/workbench/OUTPUT/zUMIs_output/.tmpMerge//MouseSmartSeq2.MouseSmartSeq2ad.filtered.tagged.bam,/home/workbench/OUTPUT/zUMIs_output/.tmpMerge//MouseSmartSeq2.MouseSmartSeq2ae.filtered.tagged.bam,/home/workbench/OUTPUT/zUMIs_output/.tmpMerge//MouseSmartSeq2.MouseSmartSeq2af.filtered.tagged.bam --outFileNamePrefix /home/workbench/OUTPUT/zUMIs_output/.tmpMap//tmp.MouseSmartSeq2.2. STAR version: 2.7.9a compiled: 2021-07-01T11:54:56+09:00 a524ed1d99de:/home/STAR-2.7.9a/source Apr 15 11:30:51 ..... started STAR run Apr 15 11:30:51 ..... loading genome STAR --readFilesCommand samtools view -@ 2 --outSAMmultNmax 1 --outFilterMultimapNmax 50 --outSAMunmapped Within --outSAMtype BAM Unsorted --quantMode TranscriptomeSAM --genomeDir /home/workbench/Reference/Mouse --sjdbGTFfile /home/workbench/Reference/Mus_musculus.GRCm39.105.gtf --runThreadN 2 --readFilesType SAM PE --genomeSAindexNbases 11 --limitOutSJcollapsed 5000000 --twopassMode Basic --readFilesIn /home/workbench/OUTPUT/zUMIs_output/.tmpMerge//MouseSmartSeq2.MouseSmartSeq2ag.filtered.tagged.bam,/home/workbench/OUTPUT/zUMIs_output/.tmpMerge//MouseSmartSeq2.MouseSmartSeq2ah.filtered.tagged.bam,/home/workbench/OUTPUT/zUMIs_output/.tmpMerge//MouseSmartSeq2.MouseSmartSeq2ai.filtered.tagged.bam --outFileNamePrefix /home/workbench/OUTPUT/zUMIs_output/.tmpMap//tmp.MouseSmartSeq2.3. STAR version: 2.7.9a compiled: 2021-07-01T11:54:56+09:00 a524ed1d99de:/home/STAR-2.7.9a/source Apr 15 11:30:51 ..... started STAR run Apr 15 11:30:51 ..... loading genome Apr 15 12:16:48 ..... processing annotations GTF Apr 15 12:16:48 ..... processing annotations GTF Apr 15 12:29:14 ..... inserting junctions into the genome indices Apr 15 12:29:14 ..... inserting junctions into the genome indices Apr 15 13:21:33 ..... started 1st pass mapping Apr 15 13:21:33 ..... started 1st pass mapping Apr 16 02:38:03 ..... finished 1st pass mapping Apr 16 02:38:09 ..... inserting junctions into the genome indices Apr 16 09:38:51 ..... started mapping Apr 16 12:52:00 ..... finished 1st pass mapping Apr 16 12:56:46 ..... inserting junctions into the genome indices Apr 16 16:05:16 ..... started mapping Apr 17 01:38:28 ..... finished mapping Apr 17 01:38:32 ..... finished successfully Apr 17 07:48:19 ..... finished mapping Apr 17 07:48:23 ..... finished successfully [W::bam_hdr_read] bgzf_check_EOF: Invalid argument [E::bam_hdr_read] Invalid BAM binary header [bam_cat] ERROR: couldn't read header for '/home/workbench/OUTPUT/zUMIs_output/.tmpMap//tmp.MouseSmartSeq2.2.Aligned.out.bam'. [W::bam_hdr_read] bgzf_check_EOF: Invalid argument [E::bam_hdr_read] Invalid BAM binary header [bam_cat] ERROR: couldn't read header for '/home/workbench/OUTPUT/zUMIs_output/.tmpMap//tmp.MouseSmartSeq2.2.Aligned.toTranscriptome.out.bam'. Sun Apr 17 07:52:39 JST 2022 Counting... [1] "2022-04-17 07:53:02 JST" $project [1] "MouseSmartSeq2"

$sequence_files $sequence_files$file1 $sequence_files$file1$name [1] "/home/workbench/fastq/SmartSeq2_S1_L001_R1_001.fastq.gz"

$sequence_files$file1$base_definition [1] "cDNA(1-76)"

$sequence_files$file2 $sequence_files$file2$name [1] "/home/workbench/fastq/SmartSeq2_S1_L001_R2_001.fastq.gz"

$sequence_files$file2$base_definition [1] "cDNA(1-76)"

$sequence_files$file3 $sequence_files$file3$name [1] "/home/workbench/fastq/SmartSeq2_S1_L001_I1_001.fastq.gz"

$sequence_files$file3$base_definition [1] "BC(1-8)"

$sequence_files$file4 $sequence_files$file4$name [1] "/home/workbench/fastq/SmartSeq2_S1_L001_I2_001.fastq.gz"

$sequence_files$file4$base_definition [1] "BC(1-8)"

$reference $reference$STAR_index [1] "/home/workbench/Reference/Mouse"

$reference$GTF_file [1] "/home/workbench/Reference/Mus_musculus.GRCm39.105.gtf"

$reference$exon_extension [1] FALSE

$reference$extension_length [1] 0

$reference$scaffold_length_min [1] 0

$out_dir [1] "/home/workbench/OUTPUT"

$num_threads [1] 10

$mem_limit [1] 100

$filter_cutoffs $filter_cutoffs$BC_filter $filter_cutoffs$BC_filter$num_bases [1] 1

$filter_cutoffs$BC_filter$phred [1] 20

$filter_cutoffs$UMI_filter $filter_cutoffs$UMI_filter$num_bases [1] 1

$filter_cutoffs$UMI_filter$phred [1] 20

$barcodes $barcodes$barcode_num NULL

$barcodes$automatic [1] FALSE

$barcodes$BarcodeBinning [1] 0

$barcodes$nReadsperCell [1] 1

$barcodes$demultiplex [1] FALSE

$counting_opts $counting_opts$introns [1] TRUE

$counting_opts$downsampling [1] "0"

$counting_opts$strand [1] 0

$counting_opts$Ham_Dist [1] 0

$counting_opts$velocyto [1] FALSE

$counting_opts$primaryHit [1] TRUE

$counting_opts$twoPass [1] TRUE

$counting_opts$write_ham [1] FALSE

$counting_opts$multi_overlap [1] FALSE

$counting_opts$intronProb [1] FALSE

$make_stats [1] TRUE

$which_Stage [1] "Filtering"

$read_layout [1] "PE"

$zUMIs_directory [1] "/home/zUMIs"

$samtools_exec [1] "samtools"

$pigz_exec [1] "pigz"

$STAR_exec [1] "STAR"

$Rscript_exec [1] "Rscript"

[1] "4.5e+08 Reads per chunk" [1] "Loading reference annotation from:" [1] "/home/workbench/OUTPUT/MouseSmartSeq2.final_annot.gtf" [1] "Annotation loaded!" Warning message: as_quosure() requires an explicit environment as of rlang 0.3.0. Please supply env. This warning is displayed once per session. [1] "Assigning reads to features (ex)"

    ==========     _____ _    _ ____  _____  ______          _____  
    =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
      =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
        ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
          ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
    ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
   Rsubread 1.32.4
//========================== featureCounts setting ===========================\ Input files : 1 BAM file P MouseSmartSeq2.filtered.tagged.Aligned.out ...
Annotation : R data.frame
Assignment details : .featureCounts.bam
(Note that files are saved to the output directory)
Dir for temp files : .
Threads : 10
Level : meta-feature level
Paired-end : yes
Multimapping reads : counted
Multiple alignments : primary alignment only
Multi-overlapping reads : not counted
Min overlapping bases : 1
Chimeric reads : not counted
Both ends mapped : not required

\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\ Load annotation file .Rsubread_UserProvidedAnnotation_pid2584 ... Features : 291510 Meta-features : 55414 Chromosomes/contigs : 39
Process BAM file MouseSmartSeq2.filtered.tagged.Aligned.out.bam...
Paired-end reads are included.
Assign alignments (paired-end) to features...
Total alignments : 74607609
Successfully assigned alignments : 17715334 (23.7%)
Running time : 3.44 minutes

\===================== http://subread.sourceforge.net/ ======================//

[1] "Assigning reads to features (in)"

    ==========     _____ _    _ ____  _____  ______          _____  
    =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
      =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
        ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
          ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
    ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
   Rsubread 1.32.4
//========================== featureCounts setting ===========================\ Input files : 1 BAM file P MouseSmartSeq2.filtered.tagged.Aligned.out ...
Annotation : R data.frame
Assignment details : .featureCounts.bam
(Note that files are saved to the output directory)
Dir for temp files : .
Threads : 10
Level : meta-feature level
Paired-end : yes
Multimapping reads : counted
Multiple alignments : primary alignment only
Multi-overlapping reads : not counted
Min overlapping bases : 1
Chimeric reads : not counted
Both ends mapped : not required

\===================== http://subread.sourceforge.net/ ======================//

//================================= Running ==================================\ Load annotation file .Rsubread_UserProvidedAnnotation_pid2584 ... Features : 220154 Meta-features : 28763 Chromosomes/contigs : 32
Process BAM file MouseSmartSeq2.filtered.tagged.Aligned.out.bam.ex.fea ...
Paired-end reads are included.
Assign alignments (paired-end) to features...
Total alignments : 74607609
Successfully assigned alignments : 1688384 (2.3%)
Running time : 3.37 minutes

\===================== http://subread.sourceforge.net/ ======================//

[1] "2022-04-17 08:01:55 JST" [1] "Coordinate sorting final bam file..." samtools sort: couldn't allocate memory for bam_mem [E::hts_open_format] Failed to open file /home/workbench/OUTPUT/MouseSmartSeq2.filtered.Aligned.GeneTagged.sorted.bam samtools index: failed to open "/home/workbench/OUTPUT/MouseSmartSeq2.filtered.Aligned.GeneTagged.sorted.bam": No such file or directory [1] "2022-04-17 08:01:57 JST" [1] "Here are the detected subsampling options:" [1] "Automatic downsampling" [1] "Working on barcode chunk 1 out of 1" [1] "Processing 3752 barcodes in this chunk..." [1] "/home/workbench/OUTPUT/MouseSmartSeq2.filtered.Aligned.GeneTagged.sorted.bam" Error in value[3L] : failed to open BamFile: file(s) do not exist: '/home/workbench/OUTPUT/MouseSmartSeq2.filtered.Aligned.GeneTagged.sorted.bam' Calls: reads2genes_new ... tryCatch -> tryCatchList -> tryCatchOne -> Execution halted Sun Apr 17 08:01:58 JST 2022 Loading required package: yaml Loading required package: Matrix [1] "loomR found" Error in gzfile(file, "rb") : cannot open the connection Calls: rds_to_loom -> readRDS -> gzfile In addition: Warning message: In gzfile(file, "rb") : cannot open compressed file '/home/workbench/OUTPUT/zUMIs_output/expression/MouseSmartSeq2.dgecounts.rds', probable reason 'No such file or directory' Execution halted Sun Apr 17 08:02:04 JST 2022 Descriptive statistics... [1] "I am loading useful packages for plotting..." [1] "2022-04-17 08:02:04 JST" Error in gzfile(file, "rb") : cannot open the connection Calls: readRDS -> gzfile In addition: Warning message: In gzfile(file, "rb") : cannot open compressed file '/home/workbench/OUTPUT/zUMIs_output/expression/MouseSmartSeq2.dgecounts.rds', probable reason 'No such file or directory' Execution halted Sun Apr 17 08:02:11 JST 2022

Bug description

This still results in an empty "expression" folder.

When I looked up the following error message, [W::bam_hdr_read] bgzf_check_EOF: Invalid argument [E::bam_hdr_read] Invalid BAM binary header I did find a thread suggesting that this may be due to running out of memory (https://github.com/alexdobin/STAR/issues/997), but output did not indicate a segmentation fault.

Should I set a fixed value for "mem_limit" instead of the current "null"? Please let me know what I can try.

Thank you.

Kai Battenberg

kbattenb commented 2 years ago

Hi Christoph,

Great news! Apparently the issue was not with zUMIs but with how a Windows computer shares its memory with a Docker container. I repeated the same process on a CentOS computer and there was no issue whatsoever.

So the only problem I was having was with the headers!

Thank you very much for your help!

Kai Battenberg