kbaseattic / assembly

An extensible framework for genome assembly.
MIT License
12 stars 14 forks source link

add a module for joining overlapping reads #292

Open levinas opened 9 years ago

levinas commented 9 years ago

and perhaps include it in the smart/auto recipes.

We have a case where SPAdes seems to be confused by the insert size and gets stuck in the final misassembly correction step: it's a small dataset, but SPAdes has already spent 10X time in bwa-spades than the core assembly. I haven't seen this before.

[bwa_read_seq] 1.5% bases are trimmed.
[bwa_read_seq] 5.2% bases are trimmed.
[bwa_sai2sam_pe_core] convert to sequence coordinate...
[infer_isize] (25, 50, 75) percentile: (21933, 47699, 73351)
[infer_isize] low and high boundaries: 151 and 176187 for estimating avg and std
[infer_isize] inferred external isize from 6214 pairs: 48110.862 +/- 29546.256
[infer_isize] skewness: 0.043; kurtosis: -1.214; ap_prior: 1.00e-05
[infer_isize] inferred maximum insert size: 194069 (4.94 sigma)
[bwa_sai2sam_pe_core] time elapses: 2.27 sec
[bwa_sai2sam_pe_core] changing coordinates of 1553 alignments.
[bwa_sai2sam_pe_core] align unmapped mate...
cbun commented 9 years ago

Can you elaborate?

On Thu, Feb 12, 2015, 7:55 AM Fangfang Xia notifications@github.com wrote:

and perhaps include it in the smart/auto recipes.

We have a case where SPAdes seems to be confused by the insert size and gets stuck in the final misassembly correction step: it's a small dataset, but SPAdes has already spent 10X time in bwa-spades than the core assembly.

[bwa_read_seq] 1.5% bases are trimmed. [bwa_read_seq] 5.2% bases are trimmed. [bwa_sai2sam_pe_core] convert to sequence coordinate... [infer_isize](25, 50, 75) percentile: (21933, 47699, 73351) [infer_isize] low and high boundaries: 151 and 176187 for estimating avg and std [infer_isize] inferred external isize from 6214 pairs: 48110.862 +/- 29546.256 [infer_isize] skewness: 0.043; kurtosis: -1.214; ap_prior: 1.00e-05 [infer_isize] inferred maximum insert size: 194069 (4.94 sigma) [bwa_sai2sam_pe_core] time elapses: 2.27 sec [bwa_sai2sam_pe_core] changing coordinates of 1553 alignments. [bwa_sai2sam_pe_core] align unmapped mate...

— Reply to this email directly or view it on GitHub https://github.com/kbase/assembly/issues/292.

levinas commented 9 years ago

So some paired end libs contain overlapping reads within most pairs (e.g., the classical Broad 100bp x 2 library with an insert size of 180bp). It's probably always good to join these reads into longer single end reads and remaining pairs that don't overlap to improve assembly quality. But in the past, I have not seen SPAdes have trouble handling that. In this case, SPAdes is not estimating the insert size correctly and that may have slowed down the final misassembly correction step drastically.

The tools to join reads include pear, flash and maybe some others.

cbun commented 9 years ago

Ah okay, it's clear now, thanks.

On Thu Feb 12 2015 at 10:51:14 AM Fangfang Xia notifications@github.com wrote:

So some paired end libs containing overlapping reads within most pairs (e.g., the classical Broad 100bp x 2 library with an insert size of 180bp). It's probably always good to join these reads into longer single end reads and remaining pairs that don't overlap to improve assembly quality. But in the past, I have not seen SPAdes have trouble handling that. In this case, SPAdes is not estimating the insert size correctly and that may have slowed down the final misassembly correction step drastically.

The tools to join reads include pear, flash and maybe some others.

— Reply to this email directly or view it on GitHub https://github.com/kbase/assembly/issues/292#issuecomment-74106342.