Platanus error with PacBio data due to sequence read length.

sr320 / LabDocs

Roberts Lab Documents

9 stars 17 forks source link

I left Platanus running on Hyak over the weekend and it churned through the Illumina data, but when it got to the PacBio data it threw a DNA read seq length is too long. error.

Poking around the source code, it looks like Platanus maxes out at a sequence length of 5kbp and the PacBio stuff is much longer.

It looks like we may have a few options from here.

We can try a hierarchical approach where we first run Canu (fork of the deprecated Celera assembler) on the PacBio data and then polish using the Illumina with something like Pilon.
We can correct for errors in the PacBio data with the short-read data using ProovRead and then assemble with Canu.
Assemble the Illumina data with Platanus or SparseAssemble and then overlap the PacBio data with DBG2OLC using the Illumina contigs as anchors.

Anyone have any familiarity with any of the routes?

We've still got PBJelly running on RoadRunner, so barring any catastrophes, we should have something when that completes.

From this guide: "We used the pipeline recommended by DBG2OLC(Ye, et al. 2014) to perform hybrid assemblies. In this pipeline, we used Platanus to perform De Bruijn graph assembly on the Illumina reads. We used 8.36 Gb (64.3X) of Illumina sequence data of the ISO1 D. melanogaster inbred line generated by the DPGP project (Langley, et al. 2012) to generate a De Bruijn graph assembly using Platanus. We used DBG2OLC to align our PacBio reads to the De Bruijn graph assembly to produce a ‘backbone’, then, according to the DBG2OLC standard pipeline, used the backbone generate the consensus using the programs Blasr (Chaisson and Tesler 2012) and PBDagCon. As with the PB only assemblies above, we evaluated assembly quality using the Quast package." So basically the 3rd option you suggested. I also found this recent paper that uses PacBio and Illumina reads with platanus (didn't read past the abstract though to see how they did it). I'm not familiar with Canu, but did find this Github issue with recommendations for using it for highly heterozygous data that might be helpful if you decide to go that route.

sr320 / LabDocs

Platanus error with PacBio data due to sequence read length. #614