Closed mossmatters closed 8 years ago
This issue for ARC contains a similar problem: https://github.com/ibest/ARC/issues/22
Note that the real reason is not that SPAdes crashes when coverage is low. Here you're only having less than 10 k-mers in the input reads. Assembling 5 or so reads is not a pretty good idea in any case.
This may be true for whole-genome assemblies, but not for single genes, as ARC is doing and we are attempting with HybPiper. In addition, the information is there but we are not capturing it because SPAdes does not fail gracefully at high k-mer values when it was perfectly fine at lower k-mer values.
The above example finishes fine if the higher level k-mers are excluded, for instance:
spades.py --12 gene002_interleaved.fasta -o test2 --only-assemble -t 4 -k 21,33,55,77
The contigs generated by the lower k-mer values are "probably fine." HybPiper should still generate a contig from these assemblies, but right now it will produce no sequence at all if any of the k-mer values fail.
I want HybPiper to still rely on the automatic k-mer detection in SPAdes, so I don't want HybPiper to set the k-mer values manually. However, the observations that it does the k-mers in increasing numerical order, and that the higher values are the ones that fail, lends itself to a solution:
The k-mers that HybPiper will re-try are simply everything except the highest k-mer found in the genename_spades
directory. I will put this in a separate script because reads_first.py
is getting very large and slightly unwieldy.
I have fixed this bug by adding a script: spades_runner.py
. This script now handles all the assembling duties, and will automatically detect whether some of the K-values have failed by checking for the contigs.fasta file. The script will compile a list of all genes for which that file is missing into failed_spades.txt
and attempt to re-run all of them with one fewer K value.
A list of the new commands is saved to redo_spades_commands.txt
. Any further failures, including genes for which not even the lowest K-mer value was successful, are saved to spades_duds.txt
I have noticed that the contig files within each K-mer can still be empty, resulting in an empty contigs.fasta
, so spades_runner.py
also checks for this and does not copy it to the main directory.
I discovered that SPAdes will not output a contigs.fasta file if one of the K-mer searches fails, even if the other K-mer searches were successful:
SPAdes will auto-detect the k-mer values from the read length-- here these are 300 bp paired-end reads, so SPAdes attempts k-mers of 21,33,55,77,99, and 127. SPAdes attempts the assemblies in k-mer order.
In the above example, a
final_contigs.fasta
file appears in the sub-directories K21, K33, K55, and K77. Only the initialconfigs
directory appears in K99, and the K127 directory is missing.