SPAdes assembly fails on low coverage (<=10X) contigs (killing ARC)

brantfaircloth commented 11 years ago

Hi there,

Excellent work so far on ARC - I'm really looking forward to using this! In playing around with the approach, I've run into an issue...

When assembling with SPAdes (v.2.4.0), there appears to be a problem where SPAdes/ARC choke on very low-coverage contigs. The console log from ARC is:

[2013-07-02 14:57:33,035 INFO 6413] Reading config file...
[2013-07-02 14:57:33,043 INFO 6413] Setting up working directories and building indexes...
[2013-07-02 14:58:50,929 INFO 6413] Setting up multiprocessing...
[2013-07-02 14:58:50,929 INFO 6413] Starting...
[2013-07-02 14:58:51,465 INFO 6420] Running bowtie2 for Sample1
[2013-07-02 14:58:51,471 INFO 6420] Calling bowtie2-build for sample: Sample1
[2013-07-02 14:58:51,472 INFO 6420] bowtie2-build -f /path/to/uce-probes-arc-format.fasta /path/to/working_Sample1/idx/idx
[2013-07-02 14:58:52,066 INFO 6420] Calling bowtie2 for sample: Sample1
[2013-07-02 14:58:52,066 INFO 6420] nice -n 19 bowtie2 -I 0 -X 1500 --local -p 12 -x /path/to/working_Sample1/idx/idx -1 /path/to/reads/gallus-gallus-READ1.fastq -2 /path/to/reads/gallus-gallus-READ2.fastq -U /path/to/reads/gallus-gallus-READ-singleton.fastq -S /path/to/working_Sample1/mapping.sam
[2013-07-02 14:59:25,197 INFO 6420] Sample: Sample1, Processed 2643335 lines in 5.86183905602 seconds.
[2013-07-02 14:59:25,637 INFO 6420] Running splitreads for Sample1
[2013-07-02 14:59:28,473 INFO 6420] Split 84 reads for sample Sample1 target uce1119 in 2.83518695831 seconds
[2013-07-02 14:59:29,083 INFO 6421] Running Spades for sample: Sample1 target: uce1119
[2013-07-02 14:59:29,094 INFO 6421] Calling spades for sample: Sample1 target uce1119
[2013-07-02 14:59:29,094 INFO 6421] spades.py -t 1 -1 /path/to/working_Sample1/t__001379/PE1.fastq -2 /path/to/working_Sample1/t__001379/PE2.fastq -s /path/to/working_Sample1/t__001379/SE.fastq -o /path/to/working_Sample1/t__001379/assembly
[2013-07-02 14:59:30,401 INFO 6420] Split 4 reads for sample Sample1 target uce4473 in 4.76373004913 seconds
[2013-07-02 14:59:30,554 INFO 6423] Running Spades for sample: Sample1 target: uce4473
[2013-07-02 14:59:30,555 ERROR 6423] An unhandled exception occured
Process ProcessRunner-4:
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/home/bcf/git/ARC/bin/../ARC/process_runner.py", line 66, in run
    raise e
KeyError: 'assembly_SE'
[2013-07-02 14:59:30,590 ERROR 6413] Fatal error returned from ProcessRunner-4
[2013-07-02 14:59:30,590 ERROR 6413] Terminating processes
[2013-07-02 14:59:30,590 ERROR 6413] A fatal error was encountered.
    'Unrecoverable error'

I figured this might be related to the 4 reads split into the PE files for the target, so I ran SPAdes against just those reads split into t__002555 (the directory containing the PE files for the 4 reads) with:

spades.py -t 1 -1 /path/to/working_Sample1/t__002555/PE1.fastq -2 /path/to/working_Sample1/t__002555/PE2.fastq -o /path/to/working_Sample1/t__002555/assembly

The error returned from SPAdes was (snipped):

Verification of expression 'cov_.size() > 10' failed in function 'void cov_model::KMerCoverageModel::Fit()'. In file '/home/yasha/gitrep/algorithmic-biology/assembler/src/debruijn/kmer_coverage_model.cpp' on line 192. Message 'Invalid kmer coverage histogram'.
Verification of expression 'cov_.size() > 10' failed in function 'void cov_model::KMerCoverageModel::Fit()'. In file '/home/yasha/gitrep/algorithmic-biology/assembler/src/debruijn/kmer_coverage_model.cpp' on line 192. Message 'Invalid kmer coverage histogram'.
Exception caught std::exception

I looked around for an easy way to turn off the coverage limitation for the assembly to see if that would allow the process to continue, but SPAdes seems to have no CLI flags to do that (at least not readily apparent).

Thanks very much and keep up the excellent work!

best, b

samhunter commented 11 years ago

Hi Brant, thanks for the detailed error report. Currently the assembler isn't called if no reads are mapped, but I hadn't run into the situation where only a handful are mapped and cause a crash like this.

What do you think of this solution? Use a minimum-mapped-reads threshold which defaults to ~50 reads (configurable in the ARC_config.txt) which will cause the assembly step to be skipped and reads to be used as mapping targets in the next iteration. In other words treat the reads as "contigs" and map against them instead of an assembled set of contigs. This is already partially implemented and can be enabled with map_against_reads=True in the ARC_config.txt, but doing that will still attempt to run the assembler which wouldn't solve your problem (see issue #19).

Also, please let us know how the program performs for you. It is still under heavy development. We are planning a number of enhancements/fixes/modifications and would greatly appreciate feedback and suggestions!

thanks,

Sam

brantfaircloth commented 11 years ago

Hi Sam,

Thanks for super-quick response! I think the suggested fix is perfectly reasonable (and it would be nice to control the minimum number of reads mapped at the level of the ARC config file).

The option to just treat the reads as contigs seems a good stopgap. I sort of wonder if there may be cases in which this still doesn't always fix the problem because you've really not recovered many reads for a given target - for instance, when you have a bad capture/tricky target/low coverage you may always be under 50 reads for a target or two (or many), particularly when dealing with samples that were multiplexed where some libraries just get crappy coverage).

It might also be nice to completely ignore assembly of those targets in the cases where the number of reads mapping over is below some minimum # of reads in the config file.

Finally, another approach that would be okay by me would be a graceful failure/WARN in very low coverage cases with output to the log that a given target has been dropped due to low-read-count or subsequent assembly failure (due to coverage) by SPAdes So, to rephrase, what I mean is gracefully fail any time there's an issue at a target or two - rather than being explicitly dependent on a minimum read count. Or, perhaps run, try reads as contigs, and then gracefully fail if you still get too few mapping. In the case of automating assemblies of capture data across many targets and many individuals, it might just be good enough to know you missed some targets and move along.

The potential trouble with explicitly defining the min read count is that, in some cases, if your targets are of variable size, then the minimum number of reads needed to obtain a given coverage likely scales with target size. Of course, you could take a proportion as input and write some code to scale the minimum proportion to the target size to ensure approx. >= 10X - but just dropping those targets works for me.

Otherwise, ARC appears to be working swell! I basically threw it some pretty ugly data to see what would happen. I'll be sure to let you know if I bump up against anything else... I have lots of different types of data from lots of different critters to test!

One thing I haven't tried is running against gzipped fastq files, since I didn't see that as an option in the README. That would be a great addition, and I'd be happy to add that as a feature request (separate from this issue #).

All the best and thanks much! -b

brantfaircloth commented 11 years ago

Hi Sam,

Here's another part of the same issue. At line 120 of assembler.py:

if 'SE' in self.params:
    args += ['-s', self.params['assembly_SE']]

The 'assembly_SE' is throwing a key error. The "SE" key is present in the dictionary, but "assembly_SE" is not (this may be occurring when there are no SE data for a read group/pool).

Changing to:

if 'assembly_SE' in self.params:
    args += ['-s', self.params['assembly_SE']]

appears to fix part of the issue... running a test right now.

samhunter commented 11 years ago

Thanks for reporting both of these bugs Brant, the most recent issue was left over from some previous changes I made to how I passed information around between processes, apparently I forgot to change that line.

The other problem (Spades crashing when coverage is low) should be fixed now with the following approach: I implemented better exception handling around the assembler calls. If the assembler fails, I write all of the reads out in lieu of assembled contigs and try to map against those in the next iteration. This process will continue until no more reads are incorporated, and if no contigs can be formed, the reads will be written out as contigs in the final contigs file.

I have added new testing targets and reads to the test_data. These result in only 2 reads mapped to a target in one case, and no reads in another, neither will crash either Spades or Newbler in my testing.

Also I found a nasty bug that was causing the output contigs to be overwritten on each cycle, so please update to the latest version to get rid of that.

Thanks again for reporting these bugs, and for your patience as the software matures.

Sam

asl commented 11 years ago

Note that the real reason is not that SPAdes crashes when coverage is low. Here you're only having less than 10 k-mers in the input reads. Assembling 5 or so reads is not a pretty good idea in any case.

msettles commented 11 years ago

"Assembling 5 or so reads is not pretty good idea."

That depends on the question, some of the contains we are assembling are exons and end up being only a few 100bp long. In this case 5 reads, particularly as a seed for the iterative process may be a very reasonable starting point.

Thanks for you comments Anton, if there were additional ways we could 'optimize' Spades for our particular situation that would be great, at the moment Newbler appears to perform better and faster.

Matt

Matt Settles, PhD Bioinformatics Scientist Director, Genomics Resources Core Institute for Bioinformatics and Evolutionary STudies (IBEST) University of Idaho Moscow, Idaho 83844

On Jul 21, 2013, at 3:17 AM, Anton Korobeynikov notifications@github.com<mailto:notifications@github.com> wrote:

Note that the real reason is not that SPAdes crashes when coverage is low. Here you're only have less than 10 k-mers in the input reads. Assembling 5 or so reads is not pretty good idea.

— Reply to this email directly or view it on GitHubhttps://github.com/ibest/ARC/issues/22#issuecomment-21307922.

asl commented 11 years ago

Well, ok. Let me slightly change my point. "Assembling 5 or so reads with whole-genome assembler is not a pretty good idea" :) Assemblers tend to make whole-dataset decisions, e.g. selecting various thresholds for removing erroneous reads. When you have 5 reads then everything can go wrong from this point of view.

Newbler, being an OLC assembler might be better suited for this task (however, it will surely fail when you will increase the coverage to 100x or so).

Looks like you need something which can do local re-assembly for you and maybe some early-days OLS / greedy assemblers will be even more suitable for this particular task. Surely, you will need whole-genome assembler if you have bunch of aligned stuff :)

samhunter commented 11 years ago

Hi Anton, thanks for your comments!

Yeah, there is a whole gamut of possibilities. We might have high-similarity targets with good mapping, where a local re-assembly of mapped reads would be very effective. In other cases we might have a few regions of moderate similarity where we need something more de-novo, and the reference can provide very little information. The assumptions certainly aren't the same as one would make with a whole-genome assembler, but as far as we've been able to tell, there are surprisingly few early-days OLS/greedy assemblers available (cap3 being the only one that seems to sometimes work, and it is horribly slow).

We ended up choosing Newbler and Spades after doing a number of tests in which they were the only two which performed quickly and built reasonable contigs using default parameters with small "ARC-like" sets of reads.

That being said, I think your characterization of Newbler is a little bit mistaken. In my tests, Newbler is around 10 times faster than Spades (mean run times of ~2 seconds vs 20-30 seconds respectively across ~20,000 assemblies), and sometimes handles coverage even up to 1000x without trouble. Furthermore, of the dozen or so assemblers I've tested, Newbler is the only one which handles 454 data reasonably well.

Anyway, I really like the Paired de Brujin Graph idea which is incorporated into SPAdes, I think a lot of advantages can be had by using the paired information effectively in assemblies, especially with longer insert libraries (Newbler doesn't seem to do this at all). I also haven't tested SPAdes 2.5, so perhaps there are some performance enhancement there. I look forward to seeing further developments in SPAdes and other assemblers which seem to be getting increasingly clever at solving the assembly problem. I hope in future iterations we start to see reference-aware assemblers, perhaps using further extensions to the de Brujin Graph paradigm to scaffold reads with respect to a reference as well as based on pairing information.

Sam

ibest / ARC

SPAdes assembly fails on low coverage (<=10X) contigs (killing ARC) #22