ncbi / sra-tools

SRA Tools
Other
1.13k stars 247 forks source link

Fastq-dump and prefetch #968

Closed ab4cp closed 1 month ago

ab4cp commented 1 month ago

Hi,

I am having issues with the output of the fasterq-dump command and I find it very unreliable. For example when downloading accession GSM4710468 SRR12386359 with fasterq-dump the output will be a single file, looking at the files submitted to SRA there are paired end reads there. If I use the same fasterq-dump command on another accession GSM5651509 SRR16541691 it gives the correct output.

For the SRR12386359 accession the output does not change using faster-dump --split-3 or --split-files or --include-technical. If I use fastq-dump --split-files I get 2 files with the headers attached. These look like the correct read 1 and 2 of the paired end read to me.

My questions are 1) am I correct that these are the correct read 1 and 2 for input into cellranger? 2) would i overall be better off use fastq-dump --split-files rather than fasterq-dump? I am downloading 800 SRR accessions and from my experience fastq-dump --split-files seems more reliable which is impacted by the inconsistency of files format's submitted to sra in the first place. 3) does prefetch significantly increase download speed when using fastq-dump --split-files?

Thanks for the help Screenshot 2024-09-19 at 1 14 47 pm Screenshot 2024-09-19 at 1 16 02 pm

durbrow commented 1 month ago

Please send your issue to the SRA curators at sra@ncbi.nlm.nih.gov. The curators may need to reload the data or ask the submitter to fix it.

This repository is for the SRA toolkit. We can handle technical problems and bugs in the tools. But, we have no control over the data files in the archive.

ab4cp commented 1 month ago

@durbrow thank you for replying. I would like this issue re-opened if possible as I do not consider it resolved.

This issue is not a data archive file issue. I have consistently found that fasterq-dump fails to produce the expected output on multiple different datasets. From my experience I would say it produces the expected about in 50% of the cases (I have used it over 400 times). As I cannot trust fasterq-dump I would use another option which is fastq-dump.

My questions are 1) am I correct that these are the correct read 1 and 2 for input into cellranger? I am downloading 800 SRR accessions and from my experience fastq-dump --split-files seems more reliable which is impacted by the inconsistency of files format's submitted to sra in the first place. 2) does prefetch significantly increase download speed when using fastq-dump --split-files?

wraetz commented 4 weeks ago

Here is the problem ( you can verify this yourself )

$vdb-dump SRR12588682 -R1 -C READ_TYPE READ_TYPE: SRA_READ_TYPE_BIOLOGICAL, SRA_READ_TYPE_BIOLOGICAL

--- The accession above has 2 biological reads in it - the output is what you expect in fasterq-dump as well as in fastq-dump

~$ vdb-dump SRR12386350 -R1 -C READ_TYPE READ_TYPE: SRA_READ_TYPE_TECHNICAL, SRA_READ_TYPE_TECHNICAL, SRA_READ_TYPE_BIOLOGICAL

--- This accession has 1 biological read and 2 technical reads in it. Fasterq-dump gives you just 1 output-file. This is not 'unreliable' - it is the correct output. You can make the older fastq-dump give you 2 files - but one of them is not biological - but technical. ( barcode or linker )

Not all SRA-accessions have 2 biological reads in them. If you think that the accessions in question should have 2 biological reads in them - please contact sra@ncbi.nlm.nih.gov as Ken Durbrow suggested. Reopening the issue does not help you - it is not a technical software problem it is a data problem.

durbrow commented 4 weeks ago

This repo is for the SRA software. We are not able to deal with problems with SRA data.

1) am I correct that these are the correct read 1 and 2 for input into cellranger?

This is not a question that we know how to answer. This is a data question or a question about cellranger.

2) does prefetch significantly increase download speed when using fastq-dump --split-files?

If you prefetch, then fastq-dump will not be downloading anything. The purpose of fastq-dump (and fasterq-dump) is to extract FASTQ format from an SRA data file. The purpose of prefetch is to download an SRA data file (and any prerequisites it may have). fastq-dump (and fasterq-dump) works fastest with files that have already been downloaded by prefetch.

I am downloading 800 SRR accessions and from my experience fastq-dump --split-files seems more reliable which is impacted by the inconsistency of files format's submitted to sra in the first place.

Only you can choose the tool that works best for your use case.