rvalieris / parallel-fastq-dump

parallel fastq-dump wrapper
MIT License
263 stars 33 forks source link

IndexError #38

Closed nrclaudio closed 3 years ago

nrclaudio commented 3 years ago

Hi,

I'm trying to run parallel-fastq-dump, but I get the error provided below. I can find similar issues here, but none of them solves my problem. The specific call is as follows:

parallel-fastq-dump --sra-id $1 --threads 4 --outdir raw/ --split-files --gzip

Where $1 is the result of parsing a SRR ID list.

The log from one of the SRA IDs (SRR6337208):

2021-04-30 13:45:38,797 - SRR ids: ['SRR6337208']
2021-04-30 13:45:38,797 - extra args: ['--split-files', '--gzip']
2021-04-30 13:45:38,798 - tempdir: /tmp/pfd_zhzmw3es
2021-04-30 13:45:38,798 - CMD: sra-stat --meta --quick SRR6337208

Traceback (most recent call last):
  File "/exports/humgen/cnovellarausell/conda_envs/parallel-fastq-dump/bin/parallel-fastq-dump", line 116, in get_spot_count
    total += int(l.split('|')[2].split(':')[0])
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/exports/humgen/cnovellarausell/conda_envs/parallel-fastq-dump/bin/parallel-fastq-dump", line 181, in <module>
    main()
  File "/exports/humgen/cnovellarausell/conda_envs/parallel-fastq-dump/bin/parallel-fastq-dump", line 175, in main
    pfd(args, si, extra_args)
  File "/exports/humgen/cnovellarausell/conda_envs/parallel-fastq-dump/bin/parallel-fastq-dump", line 49, in pfd
    n_spots = get_spot_count(srr_id)
  File "/exports/humgen/cnovellarausell/conda_envs/parallel-fastq-dump/bin/parallel-fastq-dump", line 122, in get_spot_count
    raise IndexError(msg.format('\n'.join(txt), '\n'.join(etxt)))
IndexError: sra-stat output parsing error!
--sra-stat STDOUT--

--sra-stat STDERR--
2021-04-30T11:47:40 sra-stat.2.11.0 int: directory not found while opening manager within virtual file system module - 
rvalieris commented 3 years ago

hello, I can't reproduce this error, so I guessing its something with your sra-tools configuration. check vdb-config -i and make sure its configured correctly, then try running sra-stat --meta --quick SRR6337208 and see if the error persists.

sra-tools help: https://github.com/ncbi/sra-tools/blob/master/README-vdb-config https://github.com/ncbi/sra-tools/wiki/05.-Toolkit-Configuration

nrclaudio commented 3 years ago

Hi,

Thanks for the prompt answer. I've checked and I doubt it has something to do with the config. When I run the command the output looks like this:

SRR6337208|TTTCATGA|8118133:1071593556:1006648492|:|:|:
SRR6337208|GTTCATGA|22024:2907168:2730976|:|:|:
SRR6337208|TTTCATGC|42988:5674416:5330512|:|:|:
... cnt'd

I've noticed however, that if I run the commands individually it works fine. I'm using slurm (one job submission per SRR ID) and a conda environment with a clean install of parallel-fastq-dump. sra-toolkit v 2.10.9. Any idea of why this is?

rvalieris commented 3 years ago

that output looks ok.

if you're using slurm, that means the process is running on a different machine, you have to make sure the config is present on the processing node as well.

nrclaudio commented 3 years ago

Solved it by changing the output directories, somehow. I made sure that for each call of parallel-fastq-dump within a project I had a dedicated directory for its output. I guess it had something to do with my directories having other folders named as raw too.

Note: in case someone comes to this thread with a similar problem. If using a shared cluster make sure to change the temporary directory to some scratch file system you might have, otherwise your home directory will probably run out of space.

ChrisSteel-bio commented 2 years ago

I keep getting the same error as @nrclaudio using this tool with Slurm and Snakemake. Strangely, the tool worked fine for the first batch of ~300 SRA files, now on the second batch run of my pipeline no matter what I do I keep getting this error. @nrclaudio, how did you get around this problem? How did you configure you directories? Many thanks

kfuku52 commented 2 years ago

I just got the same error, and here is the sra-stat output. In my case, the problem seems to be an incomplete sra file. parallel-fastq-dump successfully finished after re-downloading the sra file.

sra-stat --meta --quick /home/vagrant/gfe_data/transcriptome_assembly/tmp/1_SRA_ID/getfastq/Monotropa_uniflora.txt/SRR11994224.sra
2022-01-09T12:52:12 sra-stat.2.11.0 warn: zombie file detected: '/home/vagrant/gfe_data/transcriptome_assembly/tmp/1_SRA_ID/getfastq/Monotropa_uniflora.txt/SRR11994224.sra/tbl/SEQUENCE/col/READ/data'
2022-01-09T12:52:12 sra-stat.2.11.0 int: type unexpected while visiting directory - data: during KDirectoryVisit
2022-01-09T12:52:12 sra-stat.2.11.0 int: type unexpected while visiting directory - READ: while calling KDirectoryVisit
2022-01-09T12:52:12 sra-stat.2.11.0 int: type unexpected while visiting directory - col: while calling KDirectoryVisit
2022-01-09T12:52:12 sra-stat.2.11.0 int: type unexpected while visiting directory - SEQUENCE: while calling KDirectoryVisit
2022-01-09T12:52:12 sra-stat.2.11.0 int: type unexpected while visiting directory - tbl: while calling KDirectoryVisit
2022-01-09T12:52:12 sra-stat.2.11.0 int: type unexpected while visiting directory - while calling KDirectoryVisit
nrclaudio commented 2 years ago

I keep getting the same error as @nrclaudio using this tool with Slurm and Snakemake. Strangely, the tool worked fine for the first batch of ~300 SRA files, now on the second batch run of my pipeline no matter what I do I keep getting this error. @nrclaudio, how did you get around this problem? How did you configure you directories? Many thanks

This will highly depend on your specifics, but this is what I did for one of my samples (the directories will, of course, be different):

parallel-fastq-dump-MySample.sh

#! /bin/bash
while read srr
do
        sbatch parallel-fastq-dump.run $srr MySample
done < <(cat acc_list.txt)

parallel-fastq-dump.slurm

#!/bin/bash
#SBATCH -J parallel-fastq-dump

# Clear the environment from any previously loaded modules
module purge > /dev/null 2>&1

# Load the module environment suitable for the job
module load tools/miniconda/python3.8/4.9.2
conda activate parallel-fastq-dump
module load bioinformatics/tools/ncbi/sra/2.10.9

cd data/raw/${2} ## This will be your sample directory, where the FASTQS will be downloaded to, in this case 'MySample'

echo $PWD
parallel-fastq-dump -s  $1 --split-files --gzip -O . --tmpdir /exports/tmp/