nf-core / fetchngs

Pipeline to fetch metadata and raw FastQ files from public databases
https://nf-co.re/fetchngs
MIT License
144 stars 69 forks source link

SRATOOLS_FASTERQDUMP failing due to missing fastq file #317

Open edg1983 opened 1 month ago

edg1983 commented 1 month ago

Description of the bug

I have a list of SRR IDs in a file and am trying to download them using the pipeline. However, I constantly get an error at the SRATOOLS_FASTERQDUMP step.

This is an example of a failing .command.sh

#!/bin/bash -euo pipefail
export NCBI_SETTINGS="$PWD/user-settings.mkfg"

fasterq-dump \
    --split-files --include-technical \
    --threads 6 \
    --outfile SRX12493307_SRR16208975 \
     \
    SRR16208975

pigz \
     \
    --no-name \
    --processes 6 \
    *.fastq

cat <<-END_VERSIONS > versions.yml
"NFCORE_FETCHNGS:SRA:FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS:SRATOOLS_FASTERQDUMP":
    sratools: $(fasterq-dump --version 2>&1 | grep -Eo '[0-9.]+')
    pigz: $( pigz --version 2>&1 | sed 's/pigz //g' )
END_VERSIONS

From the log, it seems that no fastq files are found in the folder when attempting to run pigz after faster-dump. Inspecting the working folder, I can see a file named SRX12493307_SRR16208975, which contains reads but has no .fastq extension, and thus, the Pigz command fails.

Command used and terminal output

Command used:

nextflow run nf-core/fetchngs --input SRP340133_accessions.csv --outdir ./SRP340133 -profile conda,ht_cpu -c /ssu/gassu/small_bits/fht_profile.config -c custom.conf

Here, custom.conf is used to set --max-size 50g since some of the datasets are larger than 20Gb.

This is the error message from Nextflow.

ERROR ~ Error executing process > 'NFCORE_FETCHNGS:SRA:FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS:SRATOOLS_FASTERQDUMP (SRX12493307_SRR162089
75)'

Caused by:
  Process `NFCORE_FETCHNGS:SRA:FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS:SRATOOLS_FASTERQDUMP (SRX12493307_SRR16208975)` terminated with an
error exit status (1)

Command executed:

  export NCBI_SETTINGS="$PWD/user-settings.mkfg"

  fasterq-dump \
      --split-files --include-technical \
      --threads 6 \
      --outfile SRX12493307_SRR16208975 \
       \
      SRR16208975

  pigz \
       \
      --no-name \
      --processes 6 \
      *.fastq

  cat <<-END_VERSIONS > versions.yml
  "NFCORE_FETCHNGS:SRA:FASTQ_DOWNLOAD_PREFETCH_FASTERQDUMP_SRATOOLS:SRATOOLS_FASTERQDUMP":
      sratools: $(fasterq-dump --version 2>&1 | grep -Eo '[0-9.]+')
      pigz: $( pigz --version 2>&1 | sed 's/pigz //g' )
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  spots read      : 117,423,804
  reads read      : 117,423,804
  reads written   : 117,423,804
  pigz: skipping: *.fastq does not exist


### Relevant files

_No response_

### System information

Nextflow: 24.04.2
Hardware: HPC
Executor: Slurm
Container engine: Conda
Pipeline version: 1.12
Midnighter commented 1 month ago

Can you try to run that ID manually, please? Some files are simply broken.

edg1983 commented 1 month ago

If I run fasterq-dump manually, it completes correctly, but the output file doesn't have .fastq extension (the file name is simply SRX12493307_SRR16208975). This causes subsequent pigz command to fail.

I see the same error for all the about 40 SRATOOLS_FASTERQDUMP processes, so I assume the same is happening for all of them

Midnighter commented 1 month ago

Can you show your custom config, please?

edg1983 commented 1 month ago

Yes sure. Here it is.

custom.conf

process {
    withName: SRATOOLS_PREFETCH {
        ext.args = '--max-size 50g'
    }
}
Midnighter commented 1 month ago

You overwrote the default args, can you add them back as well.

--split-files --include-technical

edg1983 commented 1 month ago

Hi.

Using the custom config, I'm adding the --max-size option to prefetch and not to fasterq-dump, right?

If I get this right, I'm not altering the option used by SRATOOLS_FASTERQDUMP process. Indeed, I can see in the .command.sh that the required options (--split-files --include-technical) are set for fasterq-dump (see the code I pasted above)

Am I missing something?

Can it be that the issue is generated by SRR records that are registered as PAIRED in the SRA Archive but apparently contain only one read? See this example https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR16208945&display=metadata

Midnighter commented 1 month ago

That's my bad, I didn't pay attention to you customizing prefetch rather than dump, of course.

I can try looking at this ID myself later. In the meantime, can you remove IDs that cause errors and only run the rest?

Why did you choose sratools for download rather than files?

edg1983 commented 1 month ago

I ran the pipeline with default settings, and my understanding is that the default download method is FTP, right?

The pipeline triggered the sratools workflow automatically for some of the entries, I guess when FASTQ files were unavailable from the FTP source?

What I can see is that 92 successful executions of the SRA_FASTQ_FTP occurred, and the SRA workflow was also run for some entries.

Midnighter commented 1 month ago

I just looked at one of the runs and the read count is reported as zero. So something seems off with that run. https://www.ebi.ac.uk/ena/browser/view/SRR16208975

Midnighter commented 1 month ago

Okay, I guess, the problem has a simple cause. The runs that give you problems are marked as having a paired library layout. However, in reality they seem to be single reads. Due to the pipeline expecting paired reads, the module passes an option for the output without file extension, because normally fasterq-dump will then add suffixes and extensions.

def outfile = meta.single_end ? "${prefix}.fastq" : prefix

However, since those runs are single reads, the extension is not added, which then causes pigz to fail. Perhaps, a direct way to solve this is, if you create your own prefix for those samples only. Something like:

process {
    withName: SRATOOLS_FASTERQDUMP {
        ext.prefix = { "${meta.id}.fastq" }
    }
}
edg1983 commented 1 month ago

Hi,

Yes, this was also my hypothesis. Unfortunately, this is not the first time I have seen this kind of inconsistency in single-cell datasets in SRA. Not sure there is a way for the pipeline to catch this inconsistency from the SRA record metadata (data type == PAIRED, but the number of reads in the record == 1) and then generate a more informative warning message.

Thanks for your suggestion. I will give it a try.

Meanwhile, I went around it by editing the module script and adding a couple of lines in bash to check for the presence of files with a non-zero size, the expected name, but no fastq extension, and eventually adding the .fastq to those.

Thanks again for investigating this!