rrwick / Unicycler

hybrid assembly pipeline for bacterial genomes
GNU General Public License v3.0
537 stars 131 forks source link

stuck on a bridge #118

Open devonorourke opened 6 years ago

devonorourke commented 6 years ago

Hi Ryan, Things continue to work exceptionally well with Unicycler - thank you for such an easy yet powerful program. I'm getting hung up trying to complete one last sample (which is both pretty good luck yet incredibly frustrating to be so close to the finish!). I'm wondering if you've ever seen this phenomenon before?

Here's the really strange part. This only happens for one specific sample.

Here's the tail of the log file for the hybrid assembly, which I've reproduced a number of times now and always looks exactly the same:

    30   not found                                       
    31   not found                                       
    32   not found                                       
    33   not found                                       
    34   not found                                       
    35   not found                                       
    36   not found                                       

Saving /mnt/lustre/macmaneslab/devon/nanoPore/VpProject/unicycler/run2/hybrid/fixed5171/miniasm_assembly/15_contigs_placed.gfa

Creating miniasm/Racon bridges (2018-06-21 19:28:08)
----------------------------------------------------
    Now that the miniasm/Racon string graph is complete, Unicycler will use it to build bridges between anchor segments.

     Start → end     Best path                                 Quality

Thanks for any advice you can offer.

rrwick commented 6 years ago

Two guesses:

  1. The short read assembly graph has a very complex knot that is confounding Unicycler's path-finding algorithm. I.e. there are so many possible ways to bridge two segments, it's taking forever to finish. Unicycler has some heuristics to try to manage this sort of thing, but they're not perfect. If this is the case, it probably will finish - it just might take a long time.
  2. There is a mismatch between the short and long reads that has really thrown off Unicycler. I'm guessing this because you said that the long-read-only assembly worked okay, but the tail of your log shows lots of 'not found' contigs. This means that Unicycler can't find the short read contigs in the long read assembly. Your assembly has at least 7 not-founds, maybe more. That's worrying! Can you do a sanity check on your data, and make sure that the short-read-only assembly and long-read-only assembly genuinely look to be from the same sample?

I'm happy to dig deeper myself. Assuming the data isn't too sensitive, you could send me some relevant files (the overlaps_removed gfa file from the short-read-only assembly and the final long-read-only assembly gfa) and I'll take a look.

Ryan

devonorourke commented 6 years ago

Hi Ryan,

I've tried a few new things with this one weird sample and have one simple conclusion: I think I had too much Nanopore data.

The min50kb.fastq didn't stall - it whipped through the entire assembly without issue. However because there were so few total reads the biggest of the two chromosomes wasn't closed up (4 segments).

I performed one other filtering strategy as a sort of middle-ground by keeping only reads > 25 kb (min25kb.fastq), and that hybrid assembly finished about as quick as any of the other assemblies (under 2 hours using -t 24. In that fastq file there were:

This all is making me wonder something about the message you post in each .log file about the necessary throughput during the miniasm stage of the hybrid assembly:

Assembling contigs and long reads with miniasm (2018-07-01 11:26:18)
--------------------------------------------------------------------
    Unicycler uses miniasm to construct a string graph assembly using both the short read contigs and the long reads. It will then use the resulting string graph to pr
oduce bridges between contigs. This method requires decent coverage of long reads and therefore may not be fruitful if long reads are sparse. However, it does not rely
 on the short read assembly graph having good connectivity and is able to bridge an assembly graph even when it contains many dead ends.

I would think that for speed of assembly sake, the fewer reads the better, with the tradeoff being that you need some long reads to make bridging effective. But given your message above there is probably also a minimum depth of coverage that we'd like to target. Perhaps this is explicitly mentioned somewhere in the user guide and I'm just missing it - what value should we be shooting for?

In this rare case where I have tons and tons of reads, where does the fulcrum rest in the balancing of speed efficiency and bridging success? I can filter out most reads and keep only (few) ultra long reads; I can filter out small and long reads and keep only moderate reads a a higher throughput; I can wait for days to get a super high memory node on my compute cluster and just use all of them (though that option seems silly to me).

Thanks for your help!

rrwick commented 6 years ago

Unicycler builds bridges with long reads in two ways:

  1. Using a miniasm assembly (what you quoted above). This takes a decent amount of long reads but doesn't rely as much on a good Illumina assembly graph. You can turn this approach off with --no_miniasm
  2. By scaffolding the graph with long reads. This takes fewer long reads but relies on a good graph.

For some rough guesses, I'd say at least 20x long read depth of coverage or more is ideal, and you may continue to see gains up to about 50-100x depth. But if your Illumina reads are really good, then you can get away with less because Unicycler can rely more heavily on option 2. I've seen assemblies finish with as little as 4x depth, but that takes a fair bit of luck. If I have more than 100x depth, I subsample for length and quality (I use Filtlong, but I'm biased on the matter).

Your read files are large, but not insanely large. I wouldn't expect Unicycler to choke on them the way your describe, and it definitely shouldn't eat up 128 GB of RAM! So I still suspect something is going wrong.

I'd suggest trying a long-read-only assembly, possibly with Canu. Then compare it to your min25kb.fastq hybrid assembly. If they are in general agreement, then you can be confident that Unicycler didn't totally botch it. But if they disagree significantly, more investigation is warranted. In particular, I'd fear that your Illumina reads and Nanopore reads are of slightly different genomes, and that's what threw off Unicycler.

Ryan

devonorourke commented 6 years ago

Thanks for the detailed response Ryan.

I've assembled this sample (and others) with:

Is there a specific metric (or ten) that you would advise investigating? My tasks this morning are to:

Happy to share any info you think might be of interest.

rrwick commented 6 years ago

Yes, all of those things you said could be informative. I would also try aligning the assemblies to each other: Mauve is good if you want a GUI tool, and MUMmer is a good CLI option. When the assemblies agree with each other, that gives me confidence that they are correct. If they disagree, then more investigation is warranted. It's especially worth looking closely at the specific places where they disagree.

DanielleMStevens commented 4 years ago

Hi Ryan, Thanks to all your contributions to the microbial genomics community! Hopefully, you can also help me figure a weird problem I'm having that's related to this issue. I'm building a snakemake pipeline that uses unicycler as the assembler. I too am running into a similar issue where it seems to repeatedly get stuck on Racon bridges step. I don't think my computer is frozen since htop is active but unicycler log has yet to update in several hours. I don't think it's anything with the snakemake syntax as some samples have ran without any problems, which leaves me a little stuck on where it might be getting stuck. The dataset I'm building it off of is actually from one of your papers (https://github.com/rrwick/Bacterial-genome-assemblies-with-multiplex-MinION-sequencing). Is there anything I should look into in particular?

aaronphillips7493 commented 4 years ago

Hey Ryan, I am having a similar issue with a Snakemake pipeline that aims to assembly mitochondrial genomes using long and short reads, except my Unicycler run is freezing on the "Building long read bridges" step. I will try to filter down the long reads to reduce coverage (which sits at about 3800x; while short reads sit at about 2500x). Perhaps there is just too much data? Thanks, Aaron :)