harvardinformatics / snpArcher

Snakemake workflow for highly parallel variant calling designed for ease-of-use in non-model organisms.
MIT License
63 stars 30 forks source link

get_fastq_pe: fasterq-dump quit with error code 3 #79

Closed erikenbody closed 1 year ago

erikenbody commented 1 year ago

Hit this error on get_fastq_pe today:

disk-limit exeeded! fasterq-dump quit with error code 3

I ran commands locally and managed to get rid of disk-limit error (plenty of space was available)

fasterq-dump SRR7774199 -O SCA_5/ -e 8 -t /tmp --disk-limit-tmp 100000000000 --disk-limit 100000000000 -x

But still, I received error code 3, so I tried a bunch of things.

I think the issue has to do with how read pairs are split in this accession. Here are commands that don't error: fasterq-dump with no output folder specified does not error, but this command produces one fastq file, rather than 2. The same result when --split-files is provided: fasterq-dump SRR7774199

ffq also runs successfully, but it produces 1 fastq file:

ffq --ftp SRR7774199 | grep -Eo '"url": "[^"]*"' | grep -o '"[^"]*"$' | grep "fastq" | xargs curl --remote-name-all --output-dir ffq_SCA_5

fastq-dump also runs without error, but correctly generates two fastq files when --split-files is specific:

fastq-dump ./SRR7774199 -O SCA_5/ --split-files

All three solutions produce 2 paired fastq files without error on other accessions in this bioproject. All accessions originate from bam file uploads, so I wonder if there is an issue with how SRA handles bam file uploads.

tl;dr switching to ffq does not solve this issue, but switching to fastq-dump might…but this project on NCBI might just be garbled

erikenbody commented 1 year ago

I pushed a fix (from Cade's account on our server). basically the problem I found is that snakemake doesn't recognize code 3 as an error, so we cant use the same unix error catching. So I just look to see if read 1 was made and if not, try fastq-dump. Works with this dataset. This seems to be related to https://github.com/ncbi/sra-tools/issues/318, but none of the solutions there solved it

erikenbody commented 1 year ago

I might hesitate to implement this change because the split files from fastq-dump have illogical names. Even sorting them, I cannot map them with BWA: [mem_sam_pe] paired reads have different names: "SRR7774199.10000000", "SRR7774199.10"

Checking the reads confirms this:

zcat /data/snpArcher/SCA_5/SRR7774199_1_sorted.fastq.gz | head | grep "@"
@SRR7774199.100 100 length=87
@SRR7774199.10000000 10000000 length=87
@SRR7774199.100000000 100000000 length=87
zcat /data/snpArcher/SCA_5/SRR7774199_2_sorted.fastq.gz | head | grep "@"
@SRR7774199.1 1 length=87
@SRR7774199.10 10 length=87
@SRR7774199.1000 1000 length=87

So this bugfix does work, but I'm worried it means that the files are unusable anyway, so maybe better to keep fasterq-dump as it is most likely to get files that can go through the pipeline.

tsackton commented 1 year ago

The fact that ffq and fasterq-dump can be made to work but produce only a single output file is definitely consistent with errors in how the paired end information is encoded by NCBI and/or the submitter.

Looking at the read browser, there is definitely something wrong with this accession:

image image

It looks like for each pair of reads either read 1 or read 2 has data, but not both. So fasterq-dump correctly outputs a single file, while fastq-dump outputs r1 and r2 files but with nothing actually paired between them.

This is the first time we've seen this issue so I think we should leave the code as is and treat errors in this step as data problems. I'm closing this issue but @erikenbody please reopen if you disagree.

erikenbody commented 1 year ago

Thanks @tsackton - I agree, and that was good idea to check the browser, which seems to make it clear that only one read was uploaded