sdparekh / zUMIs

zUMIs: A fast and flexible pipeline to process RNA sequencing data with UMIs
GNU General Public License v3.0
275 stars 68 forks source link

running merge_demultiplex has index reads without R1/R2, leading to "ERROR! Fastq files are not in the same order" #371

Closed leoforster closed 1 year ago

leoforster commented 1 year ago

Hello,

I am trying to run zUMIs to quantify UMI counts from a SmartSeq3 dataset from SRA. I fetch the SRA files using prefetch from sratools and convert to fastq keeping orignal readnames using --origfmt and --define-qual '+' in fastq-dump as suggested on the zUMIs wiki page. I compress the resulting fastq files with gzip to save space, then run merge_demultiplexed_fastq.R to enable quantification via zUMIs:

Rscript zUMIs/misc/merge_demultiplexed_fastq.R -d fastqs_for_zUMIs/ -p /usr/bin/pigz -t 22

to my surprise, when running the zUMIs pipeline (using the supplied conda environments; also with a reference created on STAR v2.7.3 to ensure compatibility) the preprocessing seems to fail, throwing errors as "ERROR! Fastq files are not in the same order. Make sure to provide reads in the same order.". At the same time the program doesn't terminate and seems to initiate a STAR mapping which hangs during a splice-junction insertion step forever (?).

Taking a closer look this error seems to be introduced by merge_demultiplexed_fastq.R, im just wondering why and how i can resolve it. The R1 and R2 files have the same number of reads, while the index file has a single additional read:

> zgrep -c $ reads_for_zUMIs.R1.fastq.gz
1329620363
>  zgrep -c $ reads_for_zUMIs.R2.fastq.gz
1329620363
> zgrep -c $ reads_for_zUMIs.index.fastq.gz
1329620364

grepping the readnames and running diff i am able to find the culprit:

> zgrep '@ST-E00114:1220:HGL5HCCX2:7:1101:16731:1784' reads_for_zUMIs.index.fastq.gz
@ST-E00114:1220:HGL5HCCX2:7:1101:16731:1784

> zgrep '@ST-E00114:1220:HGL5HCCX2:7:1101:16731:1784' reads_for_zUMIs.R1.fastq.gz
> zgrep '@ST-E00114:1220:HGL5HCCX2:7:1101:16731:1784' reads_for_zUMIs.R2.fastq.gz
# no hits

but this read is present in my original supplied R1 and R2 files:

> zgrep -A4 '^@ST-E00114:1220:HGL5HCCX2:7:1101:16731:1784' SRR19885001_1.fastq.gz
@ST-E00114:1220:HGL5HCCX2:7:1101:16731:1784
GGGTAGCTGCAGGGGGATTGAGCCCCACAATGGCTGACCTTAAAAGTACTGTGCCAGCAAGACTCACTCGGAGAAAATGTGCTGTCTCTTATACACATCTCCGAGCCCACCAGACAGGCAGAAATCGCGTATGCCGTCTCCTGCCCGAACC
+
AAAAFAA-F<FAJ77FAAAAF7<FF<AJAFJFFAF-FJJ--7<A<---<-<---7FJAA-<--<<<<--77-7-<FF----<F<-7-<A---FJ-A7F77FJ---AA7-A----7--777---7-<-77<7F777--7-----7-------

> zgrep -A4 '^@ST-E00114:1220:HGL5HCCX2:7:1101:16731:1784' SRR19885001_2.fastq.gz
@ST-E00114:1220:HGL5HCCX2:7:1101:16731:1784
ATACACATCTGACGCTGCCGACGAGATTTCCAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAACCGCAAAGATGCCCCTGTACCTTATCCTCATAGCAATCCTCCGAGTAAGACGCCCCCGTCCATCTCGGCCGCCGCCGCCGCCCC
+
AAFFFJFJJJJJJJJJJ<JFJJJFJAAFFJJJJJJJJJFJFFJJJF7J-AFJJJJFFFA<FFJAFFJ77--7-<----7AFFJJ--7F7AA7-<-<-7---------7A----7---7-)7)-))-)))----7)-)))-)-)<-))))-)

so what has happened here? are reads removed in merge_demultiplex.R but not reflected in the newly generated index file? how can i fix this issue and facilitate a smooth zUMIs run?

Thanks in advance for your support.

Leo

cziegenhain commented 1 year ago

Hi,

There should not be any removal of reads! However I have had frequent issues with public data downloaded. Most likely, it may be that one of your several SRA files / input fastq files is corrupted / truncated which leads to some error or incomplete output. This could be during download or I have even seen uploaded files be corrupted directly in public repositories.

I would advise to carefully check the individual fastq files!

Best, Christoph

5 sep. 2023 kl. 16:27 skrev leoforster @.***>:

Hello,

I am trying to run zUMIs to quantify UMI counts from a SmartSeq3 dataset from SRA. I fetch the SRA files using prefetch from sratools and convert to fastq keeping orignal readnames using --origfmt and --define-qual '+' in fastq-dump as suggested on the zUMIs wiki page. I compress the resulting fastq files with gzip to save space, then run merge_demultiplexed_fastq.R to enable quantification via zUMIs:

Rscript zUMIs/misc/merge_demultiplexed_fastq.R -d fastqs_for_zUMIs/ -p /usr/bin/pigz -t 22

to my surprise, when running the zUMIs pipeline (using the supplied conda environments; also with a reference created on STAR v2.7.3 to ensure compatibility) the preprocessing seems to fail, throwing errors as "ERROR! Fastq files are not in the same order. Make sure to provide reads in the same order.". At the same time the program doesn't terminate and seems to initiate a STAR mapping which hangs during a splice-junction insertion step forever (?).

Taking a closer look this error seems to be introduced by merge_demultiplexed_fastq.R, im just wondering why and how i can resolve it. The R1 and R2 files have the same number of reads, while the index file has a single additional read:

zgrep -c $ reads_for_zUMIs.R1.fastq.gz 1329620363 zgrep -c $ reads_for_zUMIs.R2.fastq.gz 1329620363 zgrep -c $ reads_for_zUMIs.index.fastq.gz 1329620364 grepping the readnames and running diff i am able to find the culprit:

zgrep @.***:1220:HGL5HCCX2:7:1101:16731:1784' reads_for_zUMIs.index.fastq.gz @ST-E00114:1220:HGL5HCCX2:7:1101:16731:1784

zgrep @.:1220:HGL5HCCX2:7:1101:16731:1784' reads_for_zUMIs.R1.fastq.gz zgrep @.:1220:HGL5HCCX2:7:1101:16731:1784' reads_for_zUMIs.R2.fastq.gz

no hits

but this read is present in my original supplied R1 and R2 files:

zgrep -A4 @.***:1220:HGL5HCCX2:7:1101:16731:1784' SRR19885001_1.fastq.gz @ST-E00114:1220:HGL5HCCX2:7:1101:16731:1784 GGGTAGCTGCAGGGGGATTGAGCCCCACAATGGCTGACCTTAAAAGTACTGTGCCAGCAAGACTCACTCGGAGAAAATGTGCTGTCTCTTATACACATCTCCGAGCCCACCAGACAGGCAGAAATCGCGTATGCCGTCTCCTGCCCGAACC + AAAAFAA-F<FAJ77FAAAAF7<FF<AJAFJFFAF-FJJ--7<A<---<-<---7FJAA-<--<<<<--77-7-<FF----<F<-7-<A---FJ-A7F77FJ---AA7-A----7--777---7-<-77<7F777--7-----7-------

zgrep -A4 @.***:1220:HGL5HCCX2:7:1101:16731:1784' SRR19885001_2.fastq.gz @ST-E00114:1220:HGL5HCCX2:7:1101:16731:1784 ATACACATCTGACGCTGCCGACGAGATTTCCAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAACCGCAAAGATGCCCCTGTACCTTATCCTCATAGCAATCCTCCGAGTAAGACGCCCCCGTCCATCTCGGCCGCCGCCGCCGCCCC + AAFFFJFJJJJJJJJJJ<JFJJJFJAAFFJJJJJJJJJFJFFJJJF7J-AFJJJJFFFA<FFJAFFJ77--7-<----7AFFJJ--7F7AA7-<-<-7---------7A----7---7-)7)-))-)))----7)-)))-)-)<-))))-)

so what has happened here? are reads removed in merge_demultiplex.R but not reflected in the newly generated index file? how can i fix this issue and facilitate a smooth zUMIs run?

Thanks in advance for your support.

Leo

— Reply to this email directly, view it on GitHub https://github.com/sdparekh/zUMIs/issues/371, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD6PWY3VXEX724XHZOAYCSLXY4ZEXANCNFSM6AAAAAA4L4JQP4. You are receiving this because you are subscribed to this thread.

leoforster commented 1 year ago

ok this was indeed an issue of truncated files from SRA.

for posterity, I now check SRA download md5sums with vdb-validate and use fastqc with multiqc to check readlengths across files. Alternatively, counting line lengths with ls *fastq.gz | xargs -I % sh -c "zcat % | awk '{++a[length()]} END{for (i in a) print i, a[i]}' > %.linecounts.txt" is reasonably fast too.