ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
497 stars 91 forks source link

Using wtdbg2 to assemble phased haplotypes #240

Closed Redmar-van-den-Berg closed 2 years ago

Redmar-van-den-Berg commented 2 years ago

First of all, thanks for creating wtdbg2, it runs very fast, and I like how detailed the algorithm is described in the publication.

I have a targetted HiFi dataset of pharmacogenetic genes, and I want to reconstruct phased haplotypes from this data (for each gene). The reason for this is that different alleles have different activities in metabolising drugs, which means that the collapsed consensus does not provide enough information.

I've run wtdbg2 -x ccs and wtpoa-cns on this dataset, and the contigs match the phasing of the HiFi reads (from whatshap) very well. However, as expected, the different haplotypes are collapsed in the final dbg.raw.fa.

If I understand correctly, there are two causes for this:

  1. Different K-bins can be represented by the same vertex if they are similar enough to align together, which could mean K-bins from different haplotypes end up in the same vertex.
  2. wtpoa-cns explicitly collapses the two haplotypes to generate the consensus.

What would be the best way to get phased haplotypes from wtdbg2? The information that I need should be in dbg.ctg.lay.gz, but I'm not sure how to process that file to non-consensus contigs. Any pointers would be appreciated.

ruanjue commented 2 years ago

wtdbg2 aimed to assemble long noisy reads, so there is no phasing module. K-bin is better at tolerating sequencing errors and processing long reads very fast, but leads to collapse haplotypes. In your case, first polish contigs following README.md, then use other phasing tools(e.g. longshot).

Jue

Redmar-van-den-Berg commented 2 years ago

Thank you for confirming that K-bins can lead to collapsed haplotypes, I'll try other phasing tools as you suggested.