sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
275 stars 68 forks source link

BUG: zUMIs from demuxed files fails during SAMtools #259

Closed niherndon closed 3 years ago

niherndon commented 3 years ago

Describe the bug

To Reproduce Provide the YAML file and full standard output. YAML: project: XD_210511_demux sequence_files: file1: name: /data/phoebe/NICO/extra/XD210510/fastqs/reads_for_zUMIs.R1.fastq.gz base_definition:

out_dir: /data/phoebe/NICO/extra/XD210510/zUMIs num_threads: 12 mem_limit: 50 filter_cutoffs: BC_filter: num_bases: 3 phred: 20 UMI_filter: num_bases: 2 phred: 20 barcodes: barcode_num: ~ barcode_file: /data/phoebe/NICO/extra/XD210510/fastqs/reads_for_zUMIs.expected_barcodes.txt automatic: no BarcodeBinning: 1 nReadsperCell: 100 demultiplex: yes counting_opts: introns: yes downsampling: '0' strand: 0 Ham_Dist: 1 write_ham: no velocyto: no primaryHit: yes twoPass: no make_stats: yes which_Stage: Filtering read_layout: SE zUMIs_directory: /home/niherndon/bin/zUMIs samtools_exec: samtools pigz_exec: pigz STAR_exec: STAR Rscript_exec: Rscript

STDOUT ~/bin/zUMIs/zUMIs.sh -c -y 210510.yaml

Good news! A newer version of zUMIs is available at https://github.com/sdparekh/zUMIs


Using miniconda environment for zUMIs! note: internal executables will be used instead of those specified in the YAML file!

You provided these parameters: YAML file: 210510.yaml zUMIs directory: /home/niherndon/bin/zUMIs STAR executable STAR samtools executable samtools pigz executable pigz Rscript executable Rscript RAM limit: 50 zUMIs version 2.9.5

Tue May 11 13:48:20 EDT 2021 WARNING: The STAR version used for mapping is 2.7.3a and the STAR index was created using the version 2.7.1a. This may lead to an error while mapping. If you encounter any errors at the mapping stage, please make sure to create the STAR index using STAR 2.7.3a. Filtering...

[W::sam_parse1] empty query name [W::sam_read1] Parse error at line 59 [main_samview] truncated file. [W::sam_parse1] empty query name [W::sam_read1] Parse error at line 104 [main_samview] truncated file. [W::sam_parse1] empty query name [W::sam_read1] Parse error at line 107 [main_samview] truncated file. [W::sam_parse1] empty query name [W::sam_read1] Parse error at line 239 [main_samview] truncated file. [W::sam_parse1] empty query name [W::sam_read1] Parse error at line 236 [main_samview] truncated file. [W::sam_parse1] empty query name [W::sam_read1] Parse error at line 86 [main_samview] truncated file. [W::sam_parse1] empty query name [W::sam_read1] Parse error at line 275 [main_samview] truncated file. [W::sam_parse1] empty query name [W::sam_read1] Parse error at line 1013 [main_samview] truncated file. [W::sam_parse1] empty query name [W::sam_read1] Parse error at line 1438 [main_samview] truncated file. [W::sam_parse1] empty query name [W::sam_read1] Parse error at line 2312 [main_samview] truncated file. [W::sam_parse1] empty query name [W::sam_read1] Parse error at line 2810 [main_samview] truncated file. cat: /data/phoebe/NICO/extra/XD210510/zUMIs/zUMIs_output/.tmpMerge//XD_210511_demux.*.BCstats.txt: No such file or directory Tue May 11 13:53:14 EDT 2021 Error in eval(bysub, parent.frame(), parent.frame()) : object 'XC' not found Calls: cellBC -> [ -> [.data.table -> eval -> eval In addition: Warning message: In data.table::fread(bccount_file, header = FALSE, col.names = c("XC", : File '/data/phoebe/NICO/extra/XD210510/zUMIs/XD_210511_demux.BCstats.txt' has size 0. Returning a NULL data.table. Execution halted Mapping... [1] "2021-05-11 13:53:19 EDT" [E::hts_open_format] Failed to open file NA samtools view: failed to open "NA" for reading: No such file or directory [E::hts_open_format] Failed to open file NA samtools view: failed to open "NA" for reading: No such file or directory May 11 13:53:19 ..... started STAR run May 11 13:53:19 ..... loading genome May 11 13:53:19 ..... started STAR run May 11 13:53:19 ..... loading genome May 11 13:53:19 ..... started STAR run May 11 13:53:19 ..... loading genome May 11 13:53:19 ..... started STAR run May 11 13:53:19 ..... loading genome May 11 13:53:19 ..... started STAR run May 11 13:53:19 ..... loading genome May 11 13:53:19 ..... started STAR run May 11 13:53:19 ..... loading genome May 11 13:53:19 ..... started STAR run May 11 13:53:19 ..... loading genome May 11 13:53:19 ..... started STAR run May 11 13:53:19 ..... loading genome May 11 13:53:19 ..... started STAR run May 11 13:53:19 ..... loading genome May 11 13:53:19 ..... started STAR run May 11 13:53:19 ..... loading genome May 11 13:53:19 ..... started STAR run May 11 13:53:19 ..... loading genome May 11 13:53:19 ..... started STAR run May 11 13:53:19 ..... loading genome May 11 13:53:42 ..... processing annotations GTF May 11 13:53:42 ..... processing annotations GTF May 11 13:53:42 ..... processing annotations GTF May 11 13:53:42 ..... processing annotations GTF May 11 13:53:42 ..... processing annotations GTF May 11 13:53:42 ..... processing annotations GTF May 11 13:53:42 ..... processing annotations GTF May 11 13:53:42 ..... processing annotations GTF May 11 13:53:42 ..... processing annotations GTF May 11 13:53:42 ..... processing annotations GTF May 11 13:53:42 ..... processing annotations GTF May 11 13:53:42 ..... processing annotations GTF May 11 13:53:43 ..... inserting junctions into the genome indices May 11 13:53:43 ..... inserting junctions into the genome indices May 11 13:53:43 ..... inserting junctions into the genome indices May 11 13:53:43 ..... inserting junctions into the genome indices May 11 13:53:43 ..... inserting junctions into the genome indices May 11 13:53:43 ..... inserting junctions into the genome indices May 11 13:53:43 ..... inserting junctions into the genome indices May 11 13:53:44 ..... inserting junctions into the genome indices May 11 13:53:44 ..... inserting junctions into the genome indices May 11 13:53:44 ..... inserting junctions into the genome indices May 11 13:53:44 ..... inserting junctions into the genome indices May 11 13:53:44 ..... inserting junctions into the genome indices May 11 13:53:56 ..... started mapping May 11 13:53:56 ..... started mapping May 11 13:53:57 ..... finished mapping May 11 13:53:57 ..... started mapping May 11 13:53:57 ..... started mapping May 11 13:53:57 ..... finished mapping May 11 13:53:57 ..... finished mapping May 11 13:53:57 ..... finished mapping May 11 13:53:57 ..... finished successfully May 11 13:53:57 ..... started mapping May 11 13:53:57 ..... started mapping May 11 13:53:57 ..... finished mapping May 11 13:53:57 ..... finished mapping May 11 13:53:57 ..... finished successfully May 11 13:53:58 ..... finished successfully May 11 13:53:58 ..... finished successfully May 11 13:53:58 ..... finished successfully May 11 13:53:58 ..... started mapping May 11 13:53:58 ..... finished successfully May 11 13:53:58 ..... started mapping May 11 13:53:58 ..... finished mapping May 11 13:53:58 ..... finished mapping May 11 13:53:58 ..... started mapping May 11 13:53:58 ..... started mapping May 11 13:53:58 ..... started mapping May 11 13:53:58 ..... started mapping May 11 13:53:58 ..... finished mapping May 11 13:53:58 ..... finished mapping May 11 13:53:59 ..... finished successfully May 11 13:53:59 ..... finished successfully May 11 13:53:59 ..... finished mapping May 11 13:53:59 ..... finished mapping May 11 13:53:59 ..... finished successfully May 11 13:53:59 ..... finished successfully May 11 13:54:00 ..... finished successfully May 11 13:54:00 ..... finished successfully Tue May 11 13:54:00 EDT 2021 Counting... [1] "2021-05-11 13:54:10 EDT" Error in fread(paste0(opt$out_dir, "/zUMIs_output/", opt$project, "kept_barcodes_binned.txt")) : File '/data/phoebe/NICO/extra/XD210510/zUMIs/zUMIs_output/XD_210511_demuxkept_barcodes_binned.txt' does not exist or is non-readable. getwd()=='/data/phoebe/NICO/extra/XD210510/zUMIs' Execution halted Tue May 11 13:54:10 EDT 2021 Loading required package: yaml Loading required package: Matrix [1] "loomR found" Error in gzfile(file, "rb") : cannot open the connection Calls: rds_to_loom -> readRDS -> gzfile In addition: Warning message: In gzfile(file, "rb") : cannot open compressed file '/data/phoebe/NICO/extra/XD210510/zUMIs/zUMIs_output/expression/XD_210511_demux.dgecounts.rds', probable reason 'No such file or directory' Execution halted Tue May 11 13:54:13 EDT 2021 Descriptive statistics... [1] "I am loading useful packages for plotting..." [1] "2021-05-11 13:54:13 EDT" Error in data.table::fread(paste0(opt$out_dir, "/zUMIs_output/", opt$project, : File '/data/phoebe/NICO/extra/XD210510/zUMIs/zUMIs_output/XD_210511_demuxkept_barcodes.txt' does not exist or is non-readable. getwd()=='/data/phoebe/NICO/extra/XD210510/zUMIs' Execution halted Tue May 11 13:54:19 EDT 2021

Desktop (please complete the following information):

Additional context I've been able to run zUMIs successfully multiple times on fastq files directly pulled from Illumina BCL files. Never been able to run it with y'all's approach for re-combining demultiplexed fastq files. Any help is appreciated!

cziegenhain commented 3 years ago

How do your input files look before and after running the script to combine the demultiplexed files? I'm suspecting there is something wrong at this stage already.

cziegenhain commented 3 years ago

Also unrelated to the issue - it looks like this is smartseq3 data? In that case was there a reason why you don't use the cDNA piece from read 1?

niherndon commented 3 years ago

I'm not exactly sure what you mean by "how they look", but here's the head for the three input files:

zcat reads_for_zUMIs.R1.fastq.gz | head -n 20 @NB500912:333:HGF2FBGXJ:1:11104:12803:1066 1:N:0:ACTGAGCG+NTTAGACG CACNTATTTTATACATGTATGTATG + AAA#//AEEE/AEE/EEE/A/EEEE @NB500912:333:HGF2FBGXJ:1:11104:2812:1066 1:N:0:ACTGAGCG+NTTAGACG CGTNAACCTCCTTCAATGTGACTTG + A//#A/AAEEEEEEEEEEEEEAEEE @NB500912:333:HGF2FBGXJ:1:11104:6785:1068 1:N:0:ACGGAGCG+NTTAGACG CGTNAACGCTGCACTTGATAATTA + AAA#A/EAE6/A//EEE//EAAE/ @NB500912:333:HGF2FBGXJ:1:11104:7370:1069 1:N:0:ACTGAGCG+NTTAGACG AACNAGTACTTTGATATCTGTAAAA + A/A#A/E/A/EE66EEEEEEEAEEE @NB500912:333:HGF2FBGXJ:1:11104:2979:1070 1:N:0:ACTGAGCG+NTTAGACG CTTNTATATATTAGCCATCGCAAAG + AAA#AEE/E6EEEEEEEEEEEEAEE

zcat reads_for_zUMIs.R2.fastq.gz | head -n 20 @NB500912:333:HGF2FBGXJ:1:11104:12803:1066 2:N:0:ACTGAGCG+NTTAGACG AAATAAAAGAATGCTTCTTTGTGTATTTAATAGTTTATACATCTGAAAG + AAAAAEEEEE6EAAEEEE/AEEAEEEEEEEEEEEEAEEEE/EEEEEEEA @NB500912:333:HGF2FBGXJ:1:11104:2812:1066 2:N:0:ACTGAGCG+NTTAGACG ATTGTGTATACCGATCGATCGATAGATAGATATAGATAATTATTGTATGG + AAAAAEAAEAEEEEAEEEEEEEEEE//EA//EEAE/AEEEEE6EEEEEEE @NB500912:333:HGF2FBGXJ:1:11104:6785:1068 2:N:0:ACGGAGCG+NTTAGACG ACTTATTATATAGTTTAGAGGACACGATCGTAGGAGCATACGAAAGCATA + AAAAAEEEEEEE/EE6AE/<E//EAA/EE/EEEEAEAE/EAAEEEEEEEE @NB500912:333:HGF2FBGXJ:1:11104:7370:1069 2:N:0:ACTGAGCG+NTTAGACG ATATATATATAATATGTACGTTCATATTCATGTTTGTATTGTATATTGTT + AAAAAEEE/EEEEEEEEEEEEEEEEEEEEEE6EE/EEEEEEEEEEA6AEE @NB500912:333:HGF2FBGXJ:1:11104:2979:1070 2:N:0:ACTGAGCG+NTTAGACG CAGCTATGGGCCGTCGGCATCGCCGCATGCCGCCGCCGCCAACACGGCC + AAAAAEEEEEAEEEEAEEE//EAEEEEEAEAEEAEEEEEEAAEEEEEEE

zcat reads_for_zUMIs.index.fastq.gz | head -n 20 @NB500912:333:HGF2FBGXJ:1:11104:12803:1066 1:N:0:ACTGAGCG+NTTAGACG NDZBBCSJ + AAAAAAAA @NB500912:333:HGF2FBGXJ:1:11104:2812:1066 1:N:0:ACTGAGCG+NTTAGACG NDZBBCSJ + AAAAAAAA @NB500912:333:HGF2FBGXJ:1:11104:6785:1068 1:N:0:ACGGAGCG+NTTAGACG NDZBBCSJ + AAAAAAAA @NB500912:333:HGF2FBGXJ:1:11104:7370:1069 1:N:0:ACTGAGCG+NTTAGACG NDZBBCSJ + AAAAAAAA @NB500912:333:HGF2FBGXJ:1:11104:2979:1070 1:N:0:ACTGAGCG+NTTAGACG NDZBBCSJ + AAAAAAAA

and yes, this is smart-seq3. We use short read sequencing, only 75 read cycles. I know 75-80% of the R1 fragments are cDNA but I was under the impression that you couldn't have the program look for UMI and cDNA in the same region. I see good mapping rates just based on the 50bp R2 in most situations, but if there's a way around it I'm happy to hear.

On Tue, May 11, 2021 at 3:42 PM cziegenhain @.***> wrote:

Also unrelated to the issue - it looks like this is smartseq3 data? In that case was there a reason why you don't use the cDNA piece from read 1?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sdparekh/zUMIs/issues/259#issuecomment-839066103, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARDFHW6DGMDWBJP32PDOBWDTNGCDHANCNFSM44WPE3YQ .

cziegenhain commented 3 years ago

Hi,

It looks like your reads are trimmed (most are not the 75PE you mention). I suspect that there are reads that don't have any sequence left (trimmed away completely). Those will produce the error you have pasted above, so you should filter them out.

Best, Christoph

niherndon commented 3 years ago

Swati, The nonsense index sequences are generated by merge_demultiplexed_fastq.R. Again, I've run smart-seq3 libraries fine, just not ones from merged fastq files. That's the use case I'm asking for help with.

Cristoph, The 75bp is split between a 25bp R1 and a 50bp R2. This read distribution works fine when working with fastq files generated directly from the BCL files. The script call is /opt/BCL2FASTQ2/2.20/bin/bcl2fastq --use-bases-mask Y25,I8,I8,Y50 --no-lane-splitting --create-fastq-for-index-reads -R [nextseq run folder] I don't believe we're doing any trimming. I scrolled through the R2 and everything is between 47 and 49bp. I tried shortening the cDNA region to 45bp and got this: Filtering... [W::sam_parse1] empty query name [W::sam_read1] Parse error at line 104 [main_samview] truncated file. [W::sam_parse1] empty query name [W::sam_read1] Parse error at line 59 [main_samview] truncated file. [W::sam_parse1] [W::sam_parse1] empty query nameempty query name

[W::sam_read1] [W::sam_read1] Parse error at line 239Parse error at line 86

[main_samview] truncated file. [main_samview] truncated file. [W::sam_parse1] empty query name [W::sam_read1] Parse error at line 107 [main_samview] truncated file. [W::sam_parse1] empty query name [W::sam_read1] [W::sam_parse1] Parse error at line 236 empty query name[main_samview] truncated file.

[W::sam_read1] Parse error at line 275 [main_samview] truncated file. [W::sam_parse1] empty query name [W::sam_read1] Parse error at line 1013 [main_samview] truncated file. [W::sam_parse1] empty query name [W::sam_read1] Parse error at line 1438 [main_samview] truncated file. [W::sam_parse1] empty query name [W::sam_read1] Parse error at line 2312 [main_samview] truncated file. [W::sam_parse1] empty query name [W::sam_read1] Parse error at line 2810 [main_samview] truncated file.

and then basically the same downstream errors.

On Wed, May 12, 2021 at 3:40 AM cziegenhain @.***> wrote:

Hi,

It looks like your reads are trimmed (most are not the 75PE you mention). I suspect that there are reads that don't have any sequence left (trimmed away completely). Those will produce the error you have pasted above, so you should filter them out.

Best, Christoph

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/sdparekh/zUMIs/issues/259#issuecomment-839541088, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARDFHW32KWDWDD7I5P6VU2TTNIWG5ANCNFSM44WPE3YQ .

sdparekh commented 3 years ago

Hi, You need not answer when the comment was removed. I did realise that and tested the merge_demultiplexed_fastq.R AGAIN. We have extensively tested this script and never had any issues. This seems like a specific case with some other underlying issue. Can you please share randomly sampled subset of your reads. You may either subset the merged or several unmerged files.

It will be easier to address your concern if we test it ourselves.

niherndon commented 3 years ago

Sorry, I didn't realize the comment had been removed. I only saw the email, I didn't realize these corresponded to the github thread. Moving to this thread now. Here's a subsample (done with seqkit sample -n 1000 ): subsample_R2.fastq.gz subsample_index.fastq.gz subsample_R1.fastq.gz

cziegenhain commented 3 years ago

Hi,

I just ran fastqc on the files you sent. Out of the 1000 reads you meant to subsample, there are only 948 non-zero reads but the length is as little as 2 bp on either read. zUMIs works with trimmed files on the cDNA end, but you cannot have reads with 0 bases or that lack the barcode/UMI ranges.

On another note: you mention "75bp is split between a 25bp R1 and a 50bp R2". This setup is highly discouraged for Smart-seq3, why did you select it? 75bp SE will be better in this case.

Best, Christoph