rrwick / Unicycler

hybrid assembly pipeline for bacterial genomes
GNU General Public License v3.0
547 stars 131 forks source link

racon is taking a long time (illumina + pacbio reads) #116

Open lfaller opened 6 years ago

lfaller commented 6 years ago

Hi all,

I usually run unicycler on minion + illumina reads but got some raw reads from pacbio the other day and wanted to try it. I used bam2fastq to get fastq files from the pacbio bam files, used filtlong to downsample them, bbnorm to downsample illumina, and then ran unicycler as usual.

It seems that the racon step in unicycler is taking much longer than it ever has. I attached the log output below. It's been running for 47 hrs on 40 cores with 250Gb of RAM.

Usually unicycler runs pretty fast when I give it downsampled data. This is the first time I am playing around with pacbio reads -- is the racon issue related to the pacbio reads, or could there be something else? Do I need to preprocess the pacbio reads differently? Should I not be using the raw reads?

Any advice is appreciated!

My bbnorm parameters: bbnorm.sh in1=R1.fastq in2=R2.fastq out1=R1_norm.fastq out2=R2_norm.fastq target=100 min=5

My filtlong parameters: filtlong --min_length 1000 --keep_percent 90 --target_bases 500000000

Starting Unicycler (2018-06-19 16:31:45)
----------------------------------------
    Welcome to Unicycler, an assembly pipeline for bacterial genomes. Since you provided both short and long reads, Unicycler will perform a hybrid assembly. It will first use SPAdes to make a short-read assembly graph, and then it will use various methods to scaffold that graph with the long reads.
    For more information, please see https://github.com/rrwick/Unicycler

Command: /usr/local/bin/unicycler -l pacbio_assembly/filtlong_out/pacbio_filtered.fastq -1 pacbio_assembly/bbnorm_out/norm_R1.fastq.gz -2 pacbio_assembly/bbnorm_out/norm_R2.fastq.gz -o pacbio_assembly/unicycler_hybrid_out --pilon_path /home/lina/pilon-1.22.jar --threads 40

Unicycler version: v0.4.2
Using 40 threads
...
...
...
Loading reads (2018-06-19 19:26:05)
110,230 / 110,230 (100.0%) - 500,002,276 bp

Assembling contigs and long reads with miniasm (2018-06-19 19:26:09)
    Unicycler uses miniasm to construct a string graph assembly using both the short read contigs and the long reads. It will then
use the resulting string graph to produce bridges between contigs. This method requires decent coverage of long reads and
therefore may not be fruitful if long reads are sparse. However, it does not rely on the short read assembly graph having good
connectivity and is able to bridge an assembly graph even when it contains many dead ends.
    Unicycler uses two types of "reads" as assembly input: anchor contigs from the short-read assembly and actual long reads which
overlap two or more of these contigs. It then assembles them with miniasm.

Aligning long reads to graph using minimap

Saving to /home/lina/pacbio_assembly/unicycler_hybrid_out/miniasm_assembly/01_assembly_reads.fastq:
  37 short-read contigs
  3,291 long reads

Finding overlaps with minimap... success
  66,937 overlaps

Assembling reads with miniasm... success
  251 segments, 243 links

Saving /home/lina/pacbio_assembly/unicycler_hybrid_out/miniasm_assembly/11_branching_paths_removed.gfa
Merging segments into unitigs:
  8 linear unitigs
  total size = 7,777,544 bp
Saving /home/lina/pacbio_assembly/unicycler_hybrid_out/miniasm_assembly/12_unitig_graph.gfa

Polishing miniasm assembly with Racon (2018-06-19 19:26:38)
    Unicycler now uses Racon to polish the miniasm assembly. It does multiple rounds of polishing to get the best consensus.
Circular unitigs are rotated between rounds such that all parts (including the ends) are polished well.

Saving to /home/lina/pacbio_assembly/unicycler_hybrid_out/miniasm_assembly/racon_polish/polishing_reads.fastq:
  37 short-read contigs
  110,230 long reads

Polish       Assembly          Mapping
round            size          quality
begin       7,777,544        32,673.36
rrwick commented 6 years ago

That's definitely too long and too much memory, so something's not right! It should work with PacBio reads, so the issue isn't obvious.

What version of Racon are you using? The most recent version of Unicycler (v0.4.6) supports the newer Racon which runs faster. So maybe try updating both Unicycler and Racon.

Another workaround would be to do the long read assembly separately (e.g. in Canu) and give it to Unicycler with the --existing_long_read_assembly option. That will skip the Racon step entirely and dodge the issue.

Let me know how you go!

Ryan

thsyd commented 5 years ago

Hi, I experienced the same phenomenon. Illumina + Pacbio. I use a HPC that have most software pre-installed as modules. Running unicycler 0.4.7 with the then-installed version of racon (dated 20170621) resulted in a stall at the same point as the OP described here.

I had the HPC staff update to racon 1.3.1. Now the process runs to the end. So perhaps it was a question of updating the dependencies to Unicycler.