How Wtdbg2 binning the reads? / How to restore the base sequence from binned sequence?

ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly

GNU General Public License v3.0

513 stars 94 forks source link

How Wtdbg2 binning the reads? / How to restore the base sequence from binned sequence? #237

Closed cyr20040123 closed 3 years ago

cyr20040123 commented 3 years ago

Hi Dr. Ruan,

Thank you for your highly efficient assembly tool. I have read your paper and am still curious about how does Wtdbg2 make the read sequence into a binned one?

For example, assume 1bin is 8bps rather than 256bps, two 100%-correct reads are shown below.

(Read_1): AACCTTGGAACCTTGGAACCTTGG
(Read_2): TTGGAACCTTGGAACCTTGGAACC

If we binned them directly, like:

(Read_1): AACCTTGG | AACCTTGG | AACCTTGG
(Read_2): TTGGAACC | TTGGAACC | TTGGAACC

We may have an issue that they may be hardly pair-wise aligned by indicating that they share kmers.

I would like to know:

If my understanding of binning the reads is correct?
How to deal with the above issue and how to deal with the kmers around the cutting positions (such kmers are in two bins)?
How to restore the base sequence from the binned nodes in the FBG?

Thank you!

ruanjue commented 3 years ago

1, you are right. 2.1, kmer-size is much smaaler than bin-size. 2.2, mostly one bin will match two adjacent subject bins, excepting the bin bounds are so luckly to be well-aligned. 2.3, that is why FBG is spare graph, when spare, there are no artificial loops on graph which caused by the problem of not-well-aligned bins. 3, wtdbg2 stores the sequences in bases, and generates a bin data structure referred to the bases.

cyr20040123 commented 3 years ago

Dr. Ruan, thank you so much for your reply! It does help me to understand.

Q3 supplement: How to restore the base sequence from the binned nodes in the FBG? Since aligned bins in an FBG node are not exactly the same referring to your answer 2.2 because they come from different reads. So when transferring binned data structure to based sequences, how to decide which corresponding based sequences to refer to? (Equivalent k-bins may come from different reads, which one will be regarded as the representative?)

ruanjue commented 3 years ago

Actually, FBG is a eulerian path graph. When obtaining the sequences, we are saying edge. Anyway, the not-well-aligned problem still exist in nodes and edges, but no warry at all. wtdbg2 construct PO-MSA for each edge and then join two linear edges' consensus seq by pairwise seq alignment. So, the seq of final unitigs are composed of edge consensus seqs in overlaping way.

cyr20040123 commented 3 years ago

Got it! Thank you so much, Dr. Ruan!