kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

"IndexError: list index out of range" during "parallel-fastq-dump" in getfastq #93

Closed Hego-CCTB closed 2 years ago

Hego-CCTB commented 2 years ago
Processing SRA ID: SRR7758420
spot_length cannot be obtained directly from the metadata.
Using total_bases/total_spots instead: 50
Library layout: single
Number of reads: 32,659,448
Single/Paired read length: 50 bp
Total bases: 1,632,972,400 bp
Processing SRR7758420 as publicly available data from SRA.
Previously-downloaded sra file was detected.
Total sampled bases: 220,588,250 bp
Command: parallel-fastq-dump -t 2 --minReadLen 25 --qual-filter-1 --skip-technical --split-3 --clip --gzip --outdir /gfe_data/transcriptome_assembly/tmp/1_Fagopyrum_esculentum/getfastq/Fagopyrum_esculentum.txt --tmpdir /gfe_data/transcriptome_assembly/tmp/1_Fagopyrum_esculentum/getfastq/Fagopyrum_esculentum.txt --minSpotId 10000 --maxSpotId 4421764 -s /gfe_data/transcriptome_assembly/tmp/1_Fagopyrum_esculentum/getfastq/Fagopyrum_esculentum.txt/SRR7758420.sra
parallel-fastq-dump stdout:

parallel-fastq-dump stderr:
2022-02-14 17:52:33,800 - SRR ids: ['/gfe_data/transcriptome_assembly/tmp/1_Fagopyrum_esculentum/getfastq/Fagopyrum_esculentum.txt/SRR7758420.sra']
2022-02-14 17:52:33,800 - extra args: ['--minReadLen', '25', '--qual-filter-1', '--skip-technical', '--split-3', '--clip', '--gzip']
2022-02-14 17:52:33,802 - tempdir: /gfe_data/transcriptome_assembly/tmp/1_Fagopyrum_esculentum/getfastq/Fagopyrum_esculentum.txt/pfd_3l5vhgvb
2022-02-14 17:52:33,802 - CMD: sra-stat --meta --quick /gfe_data/transcriptome_assembly/tmp/1_Fagopyrum_esculentum/getfastq/Fagopyrum_esculentum.txt/SRR7758420.sra
Traceback (most recent call last):
  File "/opt/conda/envs/biotools/bin/parallel-fastq-dump", line 116, in get_spot_count
    total += int(l.split('|')[2].split(':')[0])
IndexError: list index out of range

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/biotools/bin/parallel-fastq-dump", line 181, in <module>
    main()
  File "/opt/conda/envs/biotools/bin/parallel-fastq-dump", line 175, in main
    pfd(args, si, extra_args)
  File "/opt/conda/envs/biotools/bin/parallel-fastq-dump", line 49, in pfd
    n_spots = get_spot_count(srr_id)
  File "/opt/conda/envs/biotools/bin/parallel-fastq-dump", line 122, in get_spot_count
    raise IndexError(msg.format('\n'.join(txt), '\n'.join(etxt)))
IndexError: sra-stat output parsing error!
--sra-stat STDOUT--

--sra-stat STDERR--
2022-02-14T17:52:33 sra-stat.2.11.0 warn: zombie file detected: '/gfe_data/transcriptome_assembly/tmp/1_Fagopyrum_esculentum/getfastq/Fagopyrum_esculentum.txt/SRR7758420.sra/tbl/SEQUENCE/col/QUALITY/data'
2022-02-14T17:52:33 sra-stat.2.11.0 int: type unexpected while visiting directory - data: during KDirectoryVisit
2022-02-14T17:52:33 sra-stat.2.11.0 int: type unexpected while visiting directory - QUALITY: while calling KDirectoryVisit
2022-02-14T17:52:33 sra-stat.2.11.0 int: type unexpected while visiting directory - col: while calling KDirectoryVisit
2022-02-14T17:52:33 sra-stat.2.11.0 int: type unexpected while visiting directory - SEQUENCE: while calling KDirectoryVisit
2022-02-14T17:52:33 sra-stat.2.11.0 int: type unexpected while visiting directory - tbl: while calling KDirectoryVisit
2022-02-14T17:52:33 sra-stat.2.11.0 int: type unexpected while visiting directory - while calling KDirectoryVisit

amalgkit did not safely finish. Exiting.
Hego-CCTB commented 2 years ago

Encountered this, running gfe_transcriptome_assembly.sh from the gfe_pipeline. It is probably an amalgkit related issue, though. Likely a corrupted .sra file, but I will need to investigate.

Hego-CCTB commented 2 years ago

Okay, after manually removing the .sra and rerunning, download and parallel-fastq-dump worked as expected. Fault was very likely an incomplete download from a previous run. I'll close this for now, but there may be value in implementing a failsave, which redownloads the .sra file (once) in case parallel-fastq-dump throws an errror.