kfuku52 / amalgkit

RNA-seq data amalgamation for a large-scale evolutionary transcriptomics
BSD 3-Clause "New" or "Revised" License
7 stars 1 forks source link

amalgkit quant terminating without error #102

Closed Hego-CCTB closed 2 years ago

Hego-CCTB commented 2 years ago

amalgkit quant is terminating without error after encountering file-types it doesn't expect/can't use. I found 2 different cases so far:

For the first case, there is already the function check_layout_mismatch(), but it doesn't seem to be applied properly, or amalgkit terminates before calling the function (I don't find the line "This sample will be treated as single-end sequencing." in the amalgkitSTDOUT or ERR.

I have a hunch that get_newest_intermediate_file_extension() causes the termination somehow, but I need to investigate.

kfuku52 commented 2 years ago

OK, please let me know if it turned out to be difficult to fix the layout mismatch. .safely_removed should be generated when quant is done, but it would be appropriate to send an informative error message and die when a user tries to rerun quant but not getfastq.

Hego-CCTB commented 2 years ago

Running this locally on my PC, this error is not reproducible and amalgkit correctly treats the sample as a single-end read and moves on. Which may point to either an issue with the installation, Julia, or SLURM. I'll try a reinstall first.

Hego-CCTB commented 2 years ago

Okay, I checked my ideas above, but it's none of the three. I managed to reproduce the issue locally, though. It seems to be an issue with the --redo option.

Curiously, when --redo == yes, amalgkit still tries to quantify the sample using the empty .safely_removed file. This is a different issue, which has to be addressed, but at least amalgkit goes on to the next sample and doesn't terminate.

Species: Camellia sinensis
Run ID: SRR15183486
looking for index folder in  /Users/s229181/Desktop/projects/sanity_test_wd/
Index file found: /Users/s229181/Desktop/projects/sanity_test_wd/Index/Camellia_sinensis_cds.idx
Output file detected: /Users/s229181/Desktop/projects/sanity_test_wd/quant/SRR15183486/SRR15183486_abundance.tsv
Continued. The output will be overwritten. Set "--redo no" to exit.
spot_length cannot be obtained directly from the metadata.
Using total_bases/total_spots instead: 150
getfastq safely_removed flag was detected. `amalgkit quant` has been completed in this sample: /Users/s229181/Desktop/projects/sanity_test_wd/getfastq/SRR15183486
/Users/s229181/Desktop/projects/sanity_test_wd/getfastq/SRR15183486/SRR15183486.amalgkit.fastq.gz.safely_removed
Input fastq detected: /Users/s229181/Desktop/projects/sanity_test_wd/getfastq/SRR15183486/SRR15183486.amalgkit.fastq.gz.safely_removed
single end reads detected. Proceeding in single mode
Nominal length in metadata is unusually small (nan). Setting it to 200.
Fragment length set to: 200
Fragment length standard deviation set to: 20.0
Command: kallisto quant --threads 8 --index /Users/s229181/Desktop/projects/sanity_test_wd/Index/Camellia_sinensis_cds.idx -o /Users/s229181/Desktop/projects/sanity_test_wd/quant/SRR15183486 --single -l 200 -s 20.0 /Users/s229181/Desktop/projects/sanity_test_wd/getfastq/SRR15183486/SRR15183486.amalgkit.fastq.gz.safely_removed
Single-end fastq was detected even though layout = paired
This sample will be treated as single-end sequencing.
kallisto stdout:

kallisto stderr:

[quant] fragment length distribution is truncated gaussian with mean = 200, sd = 20
[index] k-mer length: 31
[index] number of targets: 52,620
[index] number of k-mers: 44,835,344
[index] number of equivalence classes: 215,417
[quant] running in single-end mode
[quant] will process file 1: /Users/s229181/Desktop/projects/sanity_test_wd/getfastq/SRR15183486/SRR15183486.amalgkit.fastq.gz.safely_removed
[quant] finding pseudoalignments for the reads ... done
[quant] processed 0 reads, 0 reads pseudoaligned
[~warn] no reads pseudoaligned.
[   em] quantifying the abundances ... done
[   em] the Expectation-Maximization algorithm ran for 52 rounds
[~warn] Warning, zero reads pseudoaligned check your input files and index

When --redo == no, amalgkit just terminates without a useful error message. It just looks like this:

Species: Camellia sinensis
Run ID: SRR15183486
looking for index folder in  /Users/s229181/Desktop/projects/sanity_test_wd/
Index file found: /Users/s229181/Desktop/projects/sanity_test_wd/Index/Camellia_sinensis_cds.idx
Output file detected: /Users/s229181/Desktop/projects/sanity_test_wd/quant/SRR15183486/SRR15183486_abundance.tsv
Exiting. Set "--redo yes" to overwrite.

Process finished with exit code 0
Hego-CCTB commented 2 years ago
def check_quant_output(sra_id, output_dir, args):
    out_path = os.path.join(output_dir, sra_id + '_abundance.tsv')
    is_output_present = os.path.exists(out_path)
    if is_output_present:
        print('Output file detected: {}'.format(out_path))
        if args.redo == 'yes':
            print('Continued. The output will be overwritten. Set "--redo no" to exit.')
            return None
        else:
            print('Exiting. Set "--redo yes" to overwrite.')
            sys.exit()
    else:
        print('Output file was not detected: {}'.format(out_path))
        return None

Looks like the sys.exit() line is the culprit for the forced stop and looks like it was intended at some point. I don't think this is useful behaviour anymore and I'd just remove that line and change the print message.

Alternatively, I can introduce an argument like --ignore_interrupts yes|no or something similar, which would ignore sys.exit(). As I'm processing thousands of SRA entries, I don't want amalgkit to terminate anytime it finds something weird with a file. I'd much rather log the cases which throw an error or a forced exit in a file or in the STDOUT and move on to the samples which work instead of terminating immediately.

Hego-CCTB commented 2 years ago

Okay, I've changed the behaviour of some things on my side:

I'll push the update if you're okay with it. Note, that these changes would not include an --ignore_interrupts option and always ignore errors from missing or unusal/unexpected files.

kfuku52 commented 2 years ago

It sounds good if --redo yes/no works well. With non-deleted getfastq outputs, --redo yes should start quant again even if there is a quant output already, but --redo no should skip it. Please push your commits if this behavior seems to be preserved.

Hego-CCTB commented 2 years ago

Had to double check, but the intended --redo behaviour is still preserved. https://github.com/kfuku52/amalgkit/commit/095fb55702671035db56fc9314b0ded7a2bdbc0d and https://github.com/kfuku52/amalgkit/commit/9b5a635be86b7df9be47f9a71f23e7eae63bbaa0 (cleaned up unused import and comment I missed)

kfuku52 commented 2 years ago

Thank you!