kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

FileNotFoundError #127

Closed kfuku52 closed 1 year ago

kfuku52 commented 1 year ago
getfastq output not found in: /lustre7/home/lustre4/kfuku/my_project/evolutionary_transcriptomics/20230527_gfe_pipeline/gfe_data/getfastq/SRR18843778, layout = single
Skipping. If you wish to obtain the .fastq file(s), run: getfastq --id SRR18843778
Traceback (most recent call last):
  File "/home/kfuku/miniconda3/bin/amalgkit", line 374, in <module>
    args.handler(args)
  File "/home/kfuku/miniconda3/bin/amalgkit", line 34, in command_getfastq
    getfastq_main(args)
  File "/home/kfuku/miniconda3/lib/python3.9/site-packages/amalgkit/getfastq.py", line 844, in getfastq_main
    metadata = sequence_extraction_1st_round(args, sra_stat, metadata, g)
  File "/home/kfuku/miniconda3/lib/python3.9/site-packages/amalgkit/getfastq.py", line 692, in sequence_extraction_1st_round
    metadata = sequence_extraction(args, sra_stat, metadata, g, start, end)
  File "/home/kfuku/miniconda3/lib/python3.9/site-packages/amalgkit/getfastq.py", line 665, in sequence_extraction
    metadata = run_fastp(sra_stat, args, sra_stat['output_dir'], metadata)
  File "/home/kfuku/miniconda3/lib/python3.9/site-packages/amalgkit/getfastq.py", line 384, in run_fastp
    inext = get_newest_intermediate_file_extension(sra_stat, work_dir=output_dir)
  File "/home/kfuku/miniconda3/lib/python3.9/site-packages/amalgkit/util.py", line 452, in get_newest_intermediate_file_extension
    raise FileNotFoundError
FileNotFoundError
getfastq output not found in: /lustre7/home/lustre4/kfuku/my_project/evolutionary_transcriptomics/20230527_gfe_pipeline/gfe_data/getfastq/SRR18843778, layout = single
Skipping. If you wish to obtain the .fastq file(s), run: getfastq --id SRR18843778
Traceback (most recent call last):
  File "/home/kfuku/miniconda3/bin/amalgkit", line 374, in <module>
    args.handler(args)
  File "/home/kfuku/miniconda3/bin/amalgkit", line 43, in command_quant
    quant_main(args)
  File "/home/kfuku/miniconda3/lib/python3.9/site-packages/amalgkit/quant.py", line 185, in quant_main
    run_quant(args, metadata, sra_id, index)
  File "/home/kfuku/miniconda3/lib/python3.9/site-packages/amalgkit/quant.py", line 104, in run_quant
    ext = get_newest_intermediate_file_extension(sra_stat, work_dir=output_dir_getfastq)
  File "/home/kfuku/miniconda3/lib/python3.9/site-packages/amalgkit/util.py", line 452, in get_newest_intermediate_file_extension
    raise FileNotFoundError
FileNotFoundError
kfuku52 commented 1 year ago

Upon the getfastq error, the directory looks like:

/getfastq/SRR18843778
├── SRR18843778.sra
├── SRR18843778_1.fastq.gz
└── SRR18843778_2.fastq.gz
kfuku52 commented 1 year ago

Full stdout/stderr:

/Users/kef74yk/opt/miniconda3/bin/python /Volumes/kfT7/Dropbox/repos/amalgkit/amalgkit/amalgkit getfastq --out_dir ./ --threads 4 --batch 3860 --redo no 
AMALGKIT version: 0.9.8
AMALGKIT command: /Volumes/kfT7/Dropbox/repos/amalgkit/amalgkit/amalgkit getfastq --out_dir ./ --threads 4 --batch 3860 --redo no
AMALGKIT bug report: https://github.com/kfuku52/amalgkit/issues
amalgkit getfastq: start
pigz found. It will be used for compression/decompression in read name formatting.
2023-05-30 13:35:37.758032: Loading metadata from: /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/metadata/metadata.tsv
--batch is specified. Processing one SRA per job.
This is 3,860th job. In total, 4,302 jobs will be necessary for this metadata table. 0 SRAs were excluded from the table (is_sampled==no).
Individual SRA size of SRR18843778: 6,698,702,200.0 bp
Number of SRAs to be processed: 1
Total target size (--max_bp): 999,999,999,999,999 bp
The sum of SRA sizes: 6,698,702,200.0 bp
Target size per SRA: 999,999,999,999,999 bp

Processing SRA ID: SRR18843778
spot_length cannot be obtained directly from metadata. Using total_bases/total_spots instead: 200
Library layout: single
Number of reads: 33,493,511
Single/Paired read length: 200 bp
Total bases: 6,698,702,200 bp
Processing SRR18843778 as publicly available data from SRA.
Previously-downloaded sra file was not detected. New sra file will be downloaded.
Trying to fetch SRR18843778 from AWS: https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR18843778/SRR18843778
SRA file was downloaded with urllib.request from AWS
Total sampled bases: 6,698,702,200 bp
Command: parallel-fastq-dump -t 4 --minReadLen 25 --qual-filter-1 --skip-technical --split-3 --clip --gzip --outdir /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778 --tmpdir /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778 --minSpotId 1 --maxSpotId 33493511 -s /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778/SRR18843778.sra
getfastq output not found in: /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778, layout = single
Skipping. If you wish to obtain the .fastq file(s), run: getfastq --id SRR18843778
Traceback (most recent call last):
  File "/Volumes/kfT7/Dropbox/repos/amalgkit/amalgkit/amalgkit", line 374, in <module>
    args.handler(args)
  File "/Volumes/kfT7/Dropbox/repos/amalgkit/amalgkit/amalgkit", line 34, in command_getfastq
    getfastq_main(args)
  File "/Volumes/kfT7/Dropbox/repos/amalgkit/amalgkit/getfastq.py", line 844, in getfastq_main
    metadata = sequence_extraction_1st_round(args, sra_stat, metadata, g)
  File "/Volumes/kfT7/Dropbox/repos/amalgkit/amalgkit/getfastq.py", line 692, in sequence_extraction_1st_round
    metadata = sequence_extraction(args, sra_stat, metadata, g, start, end)
  File "/Volumes/kfT7/Dropbox/repos/amalgkit/amalgkit/getfastq.py", line 665, in sequence_extraction
    metadata = run_fastp(sra_stat, args, sra_stat['output_dir'], metadata)
  File "/Volumes/kfT7/Dropbox/repos/amalgkit/amalgkit/getfastq.py", line 384, in run_fastp
    inext = get_newest_intermediate_file_extension(sra_stat, work_dir=output_dir)
  File "/Volumes/kfT7/Dropbox/repos/amalgkit/amalgkit/util.py", line 452, in get_newest_intermediate_file_extension
    raise FileNotFoundError
FileNotFoundError
parallel-fastq-dump stdout:
Read 8373377 spots for /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778/SRR18843778.sra
Written 8373377 spots for /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778/SRR18843778.sra
Read 8373377 spots for /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778/SRR18843778.sra
Written 8373377 spots for /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778/SRR18843778.sra
Read 8373380 spots for /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778/SRR18843778.sra
Written 8373380 spots for /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778/SRR18843778.sra
Read 8373377 spots for /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778/SRR18843778.sra
Written 8373377 spots for /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778/SRR18843778.sra

parallel-fastq-dump stderr:
2023-05-30 13:54:05,182 - SRR ids: ['/Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778/SRR18843778.sra']
2023-05-30 13:54:05,182 - extra args: ['--minReadLen', '25', '--qual-filter-1', '--skip-technical', '--split-3', '--clip', '--gzip']
2023-05-30 13:54:05,184 - tempdir: /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778/pfd_8qh81lma
2023-05-30 13:54:05,184 - CMD: sra-stat --meta --quick /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778/SRR18843778.sra
2023-05-30 13:54:05,310 - /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778/SRR18843778.sra spots: 33493511
2023-05-30 13:54:05,310 - blocks: [[1, 8373377], [8373378, 16746754], [16746755, 25120131], [25120132, 33493511]]
2023-05-30 13:54:05,310 - CMD: fastq-dump -N 1 -X 8373377 -O /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778/pfd_8qh81lma/0 --minReadLen 25 --qual-filter-1 --skip-technical --split-3 --clip --gzip /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778/SRR18843778.sra
2023-05-30 13:54:05,331 - CMD: fastq-dump -N 8373378 -X 16746754 -O /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778/pfd_8qh81lma/1 --minReadLen 25 --qual-filter-1 --skip-technical --split-3 --clip --gzip /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778/SRR18843778.sra
2023-05-30 13:54:05,340 - CMD: fastq-dump -N 16746755 -X 25120131 -O /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778/pfd_8qh81lma/2 --minReadLen 25 --qual-filter-1 --skip-technical --split-3 --clip --gzip /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778/SRR18843778.sra
2023-05-30 13:54:05,357 - CMD: fastq-dump -N 25120132 -X 33493511 -O /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778/pfd_8qh81lma/3 --minReadLen 25 --qual-filter-1 --skip-technical --split-3 --clip --gzip /Volumes/kfT7/Dropbox/data/evolutionary_transcriptomics/20230505_amalgkit/getfastq/SRR18843778/SRR18843778.sra
kfuku52 commented 1 year ago

It's PE reads, but amalgkit thinks SE instead.

kfuku52 commented 1 year ago

It's stated "single" in the metadata table.

Hego-CCTB commented 1 year ago

It's listed as "single" on the SRA as well. But when you look at the actual reads, it's pretty clearly paired. https://www.ncbi.nlm.nih.gov/sra/?term=SRR18843778

That's likely the cause of the issue. get_newest_intermediate_file_extension() handles the search for single and paired reads differently and doesn't find what it's looking for.

What happens when you manually change the layout to paired in the metadata?

kfuku52 commented 1 year ago

What happens when you manually change the layout to paired in the metadata?

Should work correctly with this, but I will update amalgkit to automatically detect the glitch.