shingocat / lrscaf

TGS scaffolding
43 stars 6 forks source link

duplicated contigs in scaffolds in human assembly. assembly size goes up to 3.24Gbp from 2.85Gbp #30

Open alekseyzimin opened 3 years ago

alekseyzimin commented 3 years ago

Hello,

I am the developer of MaSuRCA assembler. I am looking for a good long-read scaffolder and your paper had nice results. However, when I tried using your scaffolder on a human genome assembly produced by MaSuRCA with ~9Mbp N50 contig size (about 1200 contigs), I found that the scaffolder duplicated many contigs in the scaffolds, resulting in much bigger (3.24Gbp vs 2.85Gbp) final assembly size. This is not the correct behavior. Scaffolder should output about the same amount of sequence, give or take losses in merging contigs. Contigs should never be duplicated exactly unless there is a very good reason for it, and if that is done, then duplicates must be resolved by remapping the reads and re-doing consensus. I found that duplicated contigs were always on the ends of paths in nodePaths.info. My assembly, config xml, the paf output of minimap and lrscaf output are posted here:

ftp://ftp.ccb.jhu.edu/pub/alekseyz/lrscaf_debug/

Best, Aleksey Zimin

alekseyzimin commented 3 years ago

This is with version 1.1.11

shingocat commented 3 years ago

Aleksey Zimin, According the alignments, LRScaf builds the assembly graph to do the scaffold process. On the divergence node, if there are not long reads bridging unique nodes, LRScaf will break the path with the divergence node.