kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

amalgkit getfastq on folder with already-processed samples is re-running #135

Closed docxology closed 1 year ago

docxology commented 1 year ago

DRR333182 (and the other 2) have already been fully processed by amalgkit (as evidenced by the safely_removed?).

So this is unexpected behavior from getfastq, in the current version.

Processing SRA ID: DRR333182 pigz found. It will be used for compression/decompression in read name formatting. 2023-06-25 13:59:03.490591: Loading metadata from: /media/tet/56D80A6E7A225267/Transcriptome/temp_metadata_4.tsv Single-end fastq was generated even though layout in the metadata = paired. This sample will be treated as single-end reads: DRR333182 Deleting old intermediate file: /media/tet/56D80A6E7A225267/Transcriptome/getfastq/DRR333182/DRR333182.fastq.gz Deleting old intermediate file: /media/tet/56D80A6E7A225267/Transcriptome/getfastq/DRR333182/DRR333182_1.amalgkit.fastq.gz.safely_removed Deleting old intermediate file: /media/tet/56D80A6E7A225267/Transcriptome/getfastq/DRR333182/DRR333182_2.amalgkit.fastq.gz.safely_removed Library layout: single Number of reads: 11,976,464 Single/Paired read length: 50 bp Total bases: 1,209,834,904 bp Processing DRR333182 as publicly available data from SRA. Previously-downloaded sra file was not detected. New sra file will be downloaded. Trying to fetch DRR333182 from AWS: https://sra-pub-run-odp.s3.amazonaws.com/sra/DRR333182/DRR333182 Individual SRA size of DRR333183: 1,715,542,686.0 bp Number of SRAs to be processed: 1 Total target size (--max_bp): 999,999,999,999,999 bp The sum of SRA sizes: 1,715,542,686.0 bp Target size per SRA: 999,999,999,999,999 bp

Processing SRA ID: DRR333183 Single-end fastq was generated even though layout in the metadata = paired. This sample will be treated as single-end reads: DRR333183 Deleting old intermediate file: /media/tet/56D80A6E7A225267/Transcriptome/getfastq/DRR333183/DRR333183.fastq.gz Deleting old intermediate file: /media/tet/56D80A6E7A225267/Transcriptome/getfastq/DRR333183/DRR333183_1.amalgkit.fastq.gz.safely_removed Deleting old intermediate file: /media/tet/56D80A6E7A225267/Transcriptome/getfastq/DRR333183/DRR333183_2.amalgkit.fastq.gz.safely_removed Library layout: single Number of reads: 16,979,184 Single/Paired read length: 50 bp Total bases: 1,715,542,686 bp Processing DRR333183 as publicly available data from SRA. Previously-downloaded sra file was not detected. New sra file will be downloaded. Trying to fetch DRR333183 from AWS: https://sra-pub-run-odp.s3.amazonaws.com/sra/DRR333183/DRR333183 pigz found. It will be used for compression/decompression in read name formatting. 2023-06-25 13:59:03.516199: Loading metadata from: /media/tet/56D80A6E7A225267/Transcriptome/temp_metadata_6.tsv pigz found. It will be used for compression/decompression in read name formatting. 2023-06-25 13:59:03.518451: Loading metadata from: /media/tet/56D80A6E7A225267/Transcriptome/temp_metadata_1.tsv Individual SRA size of DRR333181: 1,092,806,465.0 bp Number of SRAs to be processed: 1 Total target size (--max_bp): 999,999,999,999,999 bp The sum of SRA sizes: 1,092,806,465.0 bp Target size per SRA: 999,999,999,999,999 bp

Processing SRA ID: DRR333182 pigz found. It will be used for compression/decompression in read name formatting. 2023-06-25 13:59:03.490591: Loading metadata from: /media/tet/56D80A6E7A225267/Transcriptome/temp_metadata_4.tsv Single-end fastq was generated even though layout in the metadata = paired. This sample will be treated as single-end reads: DRR333182 Deleting old intermediate file: /media/tet/56D80A6E7A225267/Transcriptome/getfastq/DRR333182/DRR333182.fastq.gz Deleting old intermediate file: /media/tet/56D80A6E7A225267/Transcriptome/getfastq/DRR333182/DRR333182_1.amalgkit.fastq.gz.safely_removed Deleting old intermediate file: /media/tet/56D80A6E7A225267/Transcriptome/getfastq/DRR333182/DRR333182_2.amalgkit.fastq.gz.safely_removed Library layout: single Number of reads: 11,976,464 Single/Paired read length: 50 bp Total bases: 1,209,834,904 bp Processing DRR333182 as publicly available data from SRA. Previously-downloaded sra file was not detected. New sra file will be downloaded. Trying to fetch DRR333182 from AWS: https://sra-pub-run-odp.s3.amazonaws.com/sra/DRR333182/DRR333182 Individual SRA size of DRR333183: 1,715,542,686.0 bp Number of SRAs to be processed: 1 Total target size (--max_bp): 999,999,999,999,999 bp The sum of SRA sizes: 1,715,542,686.0 bp Target size per SRA: 999,999,999,999,999 bp

Processing SRA ID: DRR333183 Single-end fastq was generated even though layout in the metadata = paired. This sample will be treated as single-end reads: DRR333183 Deleting old intermediate file: /media/tet/56D80A6E7A225267/Transcriptome/getfastq/DRR333183/DRR333183.fastq.gz Deleting old intermediate file: /media/tet/56D80A6E7A225267/Transcriptome/getfastq/DRR333183/DRR333183_1.amalgkit.fastq.gz.safely_removed Deleting old intermediate file: /media/tet/56D80A6E7A225267/Transcriptome/getfastq/DRR333183/DRR333183_2.amalgkit.fastq.gz.safely_removed Library layout: single Number of reads: 16,979,184 Single/Paired read length: 50 bp Total bases: 1,715,542,686 bp Processing DRR333183 as publicly available data from SRA. Previously-downloaded sra file was not detected. New sra file will be downloaded. Trying to fetch DRR333183 from AWS: https://sra-pub-run-odp.s3.amazonaws.com/sra/DRR333183/DRR333183 pigz found. It will be used for compression/decompression in read name formatting. 2023-06-25 13:59:03.516199: Loading metadata from: /media/tet/56D80A6E7A225267/Transcriptome/temp_metadata_6.tsv pigz found. It will be used for compression/decompression in read name formatting. 2023-06-25 13:59:03.518451: Loading metadata from: /media/tet/56D80A6E7A225267/Transcriptome/temp_metadata_1.tsv Individual SRA size of DRR333181: 1,092,806,465.0 bp Number of SRAs to be processed: 1 Total target size (--max_bp): 999,999,999,999,999 bp The sum of SRA sizes: 1,092,806,465.0 bp Target size per SRA: 999,999,999,999,999 bp

Hego-CCTB commented 1 year ago

Hm. Yeah, there are some strange things happening here. Looks like the samples were previously correctly processed as paired samples, but on a re-run amalgkit thinks these are single-end samples and redownloads them again.

We've recently pushed an update that dealt with falsely flagged samples (i.e. on SRA the samples were tagged as paired, but the data was actually single-end when downloaded). Maybe it has something to do with that update.

kfuku52 commented 1 year ago

@Hego-CCTB Will you be undertaking this task?

Hego-CCTB commented 1 year ago

@Hego-CCTB Will you be undertaking this task?

Yeah, I'll take care of this.

Hego-CCTB commented 1 year ago

I think I see where the problem comes from:

ls -ltrh getfastq/DRR333182/

-rw-r--r-- 1 s229181 users 499K Jun 26 12:30 DRR333182.fastq.gz
-rw-r--r-- 1 s229181 users 381M Jun 26 12:32 DRR333182_1.amalgkit.fastq.gz
-rw-r--r-- 1 s229181 users 387M Jun 26 12:32 DRR333182_2.amalgkit.fastq.gz

For some reason, there's a small third fastq file being dumped. Quantification runs correctly, and removes DRR333182_1.amalgkit.fastq.gz and DRR333182_2.amalgkit.fastq.gz leaving DRR333182.fastq.gz in the getfastq directory.

On a re-run of getfastq, amalgkit finds the lone .fastq file and flags the sample as a mislabeled single-end sample and we have the reported situation.

This is how one of the reads in DRR333182.fastq looks like in the SRA browser:

image

While it is a read-pair, the second read looks strange and I assume is not being dumped. Hence we get a third file.

@kfuku52 What do you think should we do here? Ignore the third fastq file and delete it, have quant quantify all three fastq files or try to get fastq-dump to dump these reads into a read pair instead of a single file? DRR333182.fastq contains ~17,000 "single" reads, which is ~0.1% of the total reads.

kfuku52 commented 1 year ago

The 3rd fastq represents unpairable reads. We should delete it, or suppress the file generation if there is a fastq-dump option for it.

kfuku52 commented 1 year ago

AMALGKIT used to have a code chunk for the 3rd fastq deletion, but I removed it last week to fix a bug in https://github.com/kfuku52/amalgkit/issues/133. We might have to recover the deleted code, not in the original place in detect_layout_from_file(), but as a separate function.

kfuku52 commented 1 year ago

I will take care of it.

kfuku52 commented 1 year ago

The bug should be fixed now. You might have to redo the failed jobs from getfastq.

docxology commented 1 year ago

Thank you. I am now on v0.9.26 and reprocessing failed jobs.

docxology commented 1 year ago

With 0.9.26 it still seems like it is deleting ".amalgkit.fastq.gz.safely_removed" as an old intermediate file.

Let me know what other information to provide or things to try, thank you.


$amalgkit getfastq AMALGKIT version: 0.9.26 AMALGKIT command: /home/tet/miniconda3/bin/amalgkit getfastq AMALGKIT bug report: https://github.com/kfuku52/amalgkit/issues amalgkit getfastq: start pigz found. It will be used for compression/decompression in read name formatting. 2023-07-07 05:18:18.153039: Loading metadata from: /media/tet/bioinformatics/Transcriptome/metadata/metadata.tsv Individual SRA size of DRR333187: 1,080,317,754.0 bp ..................... Number of SRAs to be processed: 4,334 Total target size (--max_bp): 999,999,999,999,999 bp The sum of SRA sizes: 24,229,564,123,454.0 bp Target size per SRA: 230,733,733,271 bp

Processing SRA ID: DRR333187 Single-end fastq was generated even though layout in the metadata = paired. This sample will be treated as single-end reads: DRR333187 Deleting old intermediate file: /media/tet/bioinformatics/Transcriptome/getfastq/DRR333187/DRR333187.fastq.gz Deleting old intermediate file: /media/tet/bioinformatics/Transcriptome/getfastq/DRR333187/DRR333187_1.amalgkit.fastq.gz.safely_removed Deleting old intermediate file: /media/tet/bioinformatics/Transcriptome/getfastq/DRR333187/DRR333187_2.amalgkit.fastq.gz.safely_removed Library layout: single Number of reads: 10,692,533 Single/Paired read length: 50 bp Total bases: 1,080,317,754 bp Processing DRR333187 as publicly available data from SRA. Previously-downloaded sra file was not detected. New sra file will be downloaded. Trying to fetch DRR333187 from AWS: https://sra-pub-run-odp.s3.amazonaws.com/sra/DRR333187/DRR333187 ^Z [3]+ Stopped amalgkit getfastq (base) ┌─[✗]─[tet@tetra]─[/media/tet/bioinformatics/Transcriptome] └──╼ $

kfuku52 commented 1 year ago

@docxology Could you try the latest version I've just pushed (v0.9.28)? In my env, v0.9.28 correctly skips DRR333187 if the .safely_removed files are available.

docxology commented 1 year ago

Yes, this works as far as I can tell, thank you.

kfuku52 commented 1 year ago

Great to hear it worked!