fasterq-dump fails due to output file naming error

dmalzl commented 11 months ago

I am currently trying to download a couple of raw sequencing data files using sra-tools prefetch and fasterq-dump. Prefetch works fine but I get a weird error when trying to convert the generated *.sra file to fastq with fasterq-dump. The data is paired-end and the actual path should be /scratch/daniel.malzl/work/aa/7ab6e5d29db7a0352a1f1cd4af2af3/SRX10737613_SRR14385311 but judging by the error message there seems to be some bug in the renaming code because it says the following:

        Error: fasterq-dump cannot create this file: '/scratch/daniel_1.malzl/work/aa/7ab6e5d29db7a0352a1f1cd4af2af3/SRX10737613_SRR14385311'

        Error: fasterq-dump cannot create this file: '/scratch/daniel_2.malzl/work/aa/7ab6e5d29db7a0352a1f1cd4af2af3/SRX10737613_SRR14385311'
spots read      : 174,563,529
reads read      : 349,127,058

=============================================================
An error occurred during processing.
A report was generated into the file '/users/daniel.malzl/ncbi_error_report.txt'.
If the problem persists, you may consider sending the file
to 'sra-tools@ncbi.nlm.nih.gov' for assistance.
=============================================================

fasterq-dump quit with error code 3

so it seems to insert the read1, read2 suffixes into the path causing the path to be invalid.

The version I am using is 3.0.8.

dmalzl commented 11 months ago

the executed command was this

fasterq-dump \
    --split-files --include-technical \
    --threads 6 \
    --outfile SRX10737613_SRR14385311 \
     \
    SRR14385311

wraetz commented 11 months ago

It looks like the tool is confused about the output-file. Try this command: 'fasterq-dump --split-files --include-technical SRR14385311' The --threads 6 is not necessary, it is the default. The --outfile is not neccessary, the tool will create the output-filename from the accession. I think it is confused because you included the experiment in the output-file. It should not be confused about that. I will have to investigate why this happens. In the mean time try the shortened command.

dmalzl commented 11 months ago

Thanks for the swift response and the workaround. I'll try to modify the code of the pipeline I am using. However, to me it looks like the path gets split at the . character somewhen in the process where the _1, _2 suffix is inserted and then concatenated again. So it might be the . confusing it but I try and report back

wraetz commented 11 months ago

by the way... what is the version of fasteq-dump you are using?

dmalzl commented 11 months ago

the version is 3.0.8

dmalzl commented 11 months ago

Just to let you know. This does not occur in version 2.11.0

drpatelh commented 9 months ago

Thanks for reporting @dmalzl ! And thanks for investigating @wraetz 🙏🏽

I have managed to reproduce the issue and the problem is indeed the fact that a . exists in the path where the output files will be written.

Defined a Conda environment called env.yml with the dependencies below (you can exclude pigz if you like):

name: sra-tools-3.0.8
channels:
  - conda-forge
  - bioconda
  - defaults
dependencies:
  - conda-forge::pigz=2.6
  - bioconda::sra-tools=3.0.8

Created the environment

conda env create -f env.yml

✅ Run with a path without a .

mkdir testwithoutdot
cd testwithoutdot

prefetch SRR12848126

fasterq-dump \
        --split-files --include-technical \
        --outfile SRX9315476_SRR12848126 \
        SRR12848126

2024-01-05T11:41:39 prefetch.3.0.8: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2024-01-05T11:41:39 prefetch.3.0.8: 1) Downloading 'SRR12848126'...
2024-01-05T11:41:39 prefetch.3.0.8: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.
2024-01-05T11:41:39 prefetch.3.0.8:  Downloading via HTTPS...
2024-01-05T11:41:40 prefetch.3.0.8:  HTTPS download succeed
2024-01-05T11:41:40 prefetch.3.0.8:  'SRR12848126' is valid
2024-01-05T11:41:40 prefetch.3.0.8: 1) 'SRR12848126' was downloaded successfully
2024-01-05T11:41:41 prefetch.3.0.8: 'SRR12848126' has 1 unresolved dependency
2024-01-05T11:41:41 prefetch.3.0.8: 2) Downloading 'ncbi-acc:NC_000069.6?vdb-ctx=refseq'...
2024-01-05T11:41:41 prefetch.3.0.8:  Downloading via HTTPS...
2024-01-05T11:41:43 prefetch.3.0.8:  HTTPS download succeed
2024-01-05T11:41:43 prefetch.3.0.8: 2) 'ncbi-acc:NC_000069.6?vdb-ctx=refseq' was downloaded successfully
spots read      : 1,517
reads read      : 3,034
reads written   : 2,982

:x: Run with a path that contains a .

mkdir test.withdot
cd test.withdot

prefetch SRR12848126

fasterq-dump \
        --split-files --include-technical \
        --outfile SRX9315476_SRR12848126 \
        SRR12848126

2024-01-05T11:37:35 prefetch.3.0.8: Current preference is set to retrieve SRA Normalized Format files with full base quality scores.
2024-01-05T11:37:35 prefetch.3.0.8: 1) Downloading 'SRR12848126'...
2024-01-05T11:37:35 prefetch.3.0.8: SRA Normalized Format file is being retrieved, if this is different from your preference, it may be due to current file availability.
2024-01-05T11:37:35 prefetch.3.0.8:  Downloading via HTTPS...
2024-01-05T11:37:36 prefetch.3.0.8:  HTTPS download succeed
2024-01-05T11:37:36 prefetch.3.0.8:  'SRR12848126' is valid
2024-01-05T11:37:36 prefetch.3.0.8: 1) 'SRR12848126' was downloaded successfully
2024-01-05T11:37:37 prefetch.3.0.8: 'SRR12848126' has 1 unresolved dependency
2024-01-05T11:37:37 prefetch.3.0.8: 2) Downloading 'ncbi-acc:NC_000069.6?vdb-ctx=refseq'...
2024-01-05T11:37:37 prefetch.3.0.8:  Downloading via HTTPS...
2024-01-05T11:37:55 prefetch.3.0.8:  HTTPS download succeed
2024-01-05T11:37:55 prefetch.3.0.8: 2) 'ncbi-acc:NC_000069.6?vdb-ctx=refseq' was downloaded successfully

        Error: fasterq-dump cannot create this file: '/home/harshil/test_2.withdot/SRX9315476_SRR12848126'

        Error: fasterq-dump cannot create this file: '/home/harshil/test_1.withdot/SRX9315476_SRR12848126'
spots read      : 1,517
reads read      : 3,034

=============================================================
An error occurred during processing.
A report was generated into the file '/home/harshil/ncbi_error_report.txt'.
If the problem persists, you may consider sending the file
to 'sra-tools@ncbi.nlm.nih.gov' for assistance.
=============================================================

fasterq-dump quit with error code 3

adamrtalbot commented 8 months ago

The problem is this function here, which splits on any period found and creates a new filename. It should split on the final period only, or even better use some form of path handling (not 100% familiar with code).

https://github.com/ncbi/sra-tools/blob/8575947cb74af06760b670ddc58aa318149769a6/tools/external/fasterq-dump/sbuffer.c#L146-L172

ncbi / sra-tools

fasterq-dump fails due to output file naming error #865