mossmatters / HybPiper

Recovering genes from targeted sequence capture data
GNU General Public License v3.0
111 stars 45 forks source link

SPAdes stack trace failure with large K-mer values #5

Closed mossmatters closed 8 years ago

mossmatters commented 8 years ago

I discovered that SPAdes will not output a contigs.fasta file if one of the K-mer searches fails, even if the other K-mer searches were successful:

spades.py --12 gene002_interleaved.fasta -o test1 --only-assemble -t 4

...snip...
== Running assembler: K99

...snip...

=== Stack Trace ===
[0x405b7a]
[0x619842]
[0x5f6173]
[0x44a603]
[0x402cde]
[0x41ad76]
[0x7b8170]
[0x4001b9]
Verification of expression 'cov_.size() > 10' failed in function 'void cov_model::KMerCoverageModel::Fit()'. In file '/home/yasha/gitrep3/algorithmic-biology/assembler/src/debruijn/kmer_coverage_model.cpp' on line 185. Message 'Invalid kmer coverage histogram'.
Verification of expression 'cov_.size() > 10' failed in function 'void cov_model::KMerCoverageModel::Fit()'. In file '/home/yasha/gitrep3/algorithmic-biology/assembler/src/debruijn/kmer_coverage_model.cpp' on line 185. Message 'Invalid kmer coverage histogram'.
spades: /home/yasha/gitrep3/algorithmic-biology/assembler/src/debruijn/kmer_coverage_model.cpp:185: void cov_model::KMerCoverageModel::Fit(): Assertion `cov_.size() > 10' failed.

== Error ==  system call for: "['/usr/local/bin/spades', '/scratch/mjohnson/test/gene002/test1/K99/configs/config.info']" finished abnormally, err code: -6

SPAdes will auto-detect the k-mer values from the read length-- here these are 300 bp paired-end reads, so SPAdes attempts k-mers of 21,33,55,77,99, and 127. SPAdes attempts the assemblies in k-mer order.

In the above example, a final_contigs.fasta file appears in the sub-directories K21, K33, K55, and K77. Only the initial configs directory appears in K99, and the K127 directory is missing.

mossmatters commented 8 years ago

This issue for ARC contains a similar problem: https://github.com/ibest/ARC/issues/22

Note that the real reason is not that SPAdes crashes when coverage is low. Here you're only having less than 10 k-mers in the input reads. Assembling 5 or so reads is not a pretty good idea in any case.

This may be true for whole-genome assemblies, but not for single genes, as ARC is doing and we are attempting with HybPiper. In addition, the information is there but we are not capturing it because SPAdes does not fail gracefully at high k-mer values when it was perfectly fine at lower k-mer values.

The above example finishes fine if the higher level k-mers are excluded, for instance:

spades.py --12 gene002_interleaved.fasta -o test2 --only-assemble -t 4 -k 21,33,55,77

The contigs generated by the lower k-mer values are "probably fine." HybPiper should still generate a contig from these assemblies, but right now it will produce no sequence at all if any of the k-mer values fail.

mossmatters commented 8 years ago

I want HybPiper to still rely on the automatic k-mer detection in SPAdes, so I don't want HybPiper to set the k-mer values manually. However, the observations that it does the k-mers in increasing numerical order, and that the higher values are the ones that fail, lends itself to a solution:

  1. Run SPAdes as normal from within reads_first.py
  2. Run a second script to collect info about unsuccessful assemblies and try them again with adjusted k-mer values.

The k-mers that HybPiper will re-try are simply everything except the highest k-mer found in the genename_spades directory. I will put this in a separate script because reads_first.py is getting very large and slightly unwieldy.

mossmatters commented 8 years ago

I have fixed this bug by adding a script: spades_runner.py. This script now handles all the assembling duties, and will automatically detect whether some of the K-values have failed by checking for the contigs.fasta file. The script will compile a list of all genes for which that file is missing into failed_spades.txt and attempt to re-run all of them with one fewer K value.

A list of the new commands is saved to redo_spades_commands.txt. Any further failures, including genes for which not even the lowest K-mer value was successful, are saved to spades_duds.txt

I have noticed that the contig files within each K-mer can still be empty, resulting in an empty contigs.fasta, so spades_runner.py also checks for this and does not copy it to the main directory.