Open edg1983 opened 3 months ago
Can you try to run that ID manually, please? Some files are simply broken.
If I run fasterq-dump
manually, it completes correctly, but the output file doesn't have .fastq extension (the file name is simply SRX12493307_SRR16208975
). This causes subsequent pigz
command to fail.
I see the same error for all the about 40 SRATOOLS_FASTERQDUMP processes, so I assume the same is happening for all of them
Can you show your custom config, please?
Yes sure. Here it is.
custom.conf
process {
withName: SRATOOLS_PREFETCH {
ext.args = '--max-size 50g'
}
}
You overwrote the default args, can you add them back as well.
--split-files --include-technical
Hi.
Using the custom config, I'm adding the --max-size
option to prefetch and not to fasterq-dump, right?
If I get this right, I'm not altering the option used by SRATOOLS_FASTERQDUMP
process. Indeed, I can see in the .command.sh
that the required options (--split-files --include-technical
) are set for fasterq-dump
(see the code I pasted above)
Am I missing something?
Can it be that the issue is generated by SRR records that are registered as PAIRED in the SRA Archive but apparently contain only one read? See this example https://trace.ncbi.nlm.nih.gov/Traces/?view=run_browser&acc=SRR16208945&display=metadata
That's my bad, I didn't pay attention to you customizing prefetch rather than dump, of course.
I can try looking at this ID myself later. In the meantime, can you remove IDs that cause errors and only run the rest?
Why did you choose sratools for download rather than files?
I ran the pipeline with default settings, and my understanding is that the default download method is FTP, right?
The pipeline triggered the sratools workflow automatically for some of the entries, I guess when FASTQ files were unavailable from the FTP source?
What I can see is that 92 successful executions of the SRA_FASTQ_FTP occurred, and the SRA workflow was also run for some entries.
I just looked at one of the runs and the read count is reported as zero. So something seems off with that run. https://www.ebi.ac.uk/ena/browser/view/SRR16208975
Okay, I guess, the problem has a simple cause. The runs that give you problems are marked as having a paired library layout. However, in reality they seem to be single reads. Due to the pipeline expecting paired reads, the module passes an option for the output without file extension, because normally fasterq-dump will then add suffixes and extensions.
def outfile = meta.single_end ? "${prefix}.fastq" : prefix
However, since those runs are single reads, the extension is not added, which then causes pigz to fail. Perhaps, a direct way to solve this is, if you create your own prefix
for those samples only. Something like:
process {
withName: SRATOOLS_FASTERQDUMP {
ext.prefix = { "${meta.id}.fastq" }
}
}
Hi,
Yes, this was also my hypothesis. Unfortunately, this is not the first time I have seen this kind of inconsistency in single-cell datasets in SRA. Not sure there is a way for the pipeline to catch this inconsistency from the SRA record metadata (data type == PAIRED, but the number of reads in the record == 1) and then generate a more informative warning message.
Thanks for your suggestion. I will give it a try.
Meanwhile, I went around it by editing the module script and adding a couple of lines in bash to check for the presence of files with a non-zero size, the expected name, but no fastq extension, and eventually adding the .fastq
to those.
Thanks again for investigating this!
Description of the bug
I have a list of SRR IDs in a file and am trying to download them using the pipeline. However, I constantly get an error at the SRATOOLS_FASTERQDUMP step.
This is an example of a failing
.command.sh
From the log, it seems that no fastq files are found in the folder when attempting to run
pigz
afterfaster-dump.
Inspecting the working folder, I can see a file namedSRX12493307_SRR16208975
, which contains reads but has no .fastq extension, and thus, the Pigz command fails.Command used and terminal output
Here,
custom.conf
is used to set--max-size 50g
since some of the datasets are larger than 20Gb.This is the error message from Nextflow.