ruanjue / wtdbg2

Redbean: A fuzzy Bruijn graph approach to long noisy reads assembly
GNU General Public License v3.0
513 stars 94 forks source link

Assembled sequences lower than estimated #175

Closed melop closed 4 years ago

melop commented 4 years ago

Dear Ruanjue, In the log I noticed the following: [Fri Feb 14 22:22:02 2020] TOT 1321464576, CNT 51873, AVG 25475, MAX 853248, N50 71936, L50 4603, N90 11520, L90 22412, Min 512 [Fri Feb 14 22:22:03 2020] Estimated: TOT 1372882944, CNT 28962, AVG 47403, MAX 2028544, N50 145408, L50 2312, N90 19456, L90 13048, Min 1792

Which looks like WTDBG2 is able to estimate the assembly size to be ~1.37 Gb. Given our current low coverage (~14X), this seems like a pretty good estimate compare to the cytology data of 1.53Gb. However, the actual bases inside the *.cns.fa file added up to only 1196875092 bp. So is the explanation that some of these contigs are actually repeat edges in the assembly graph, and they are only represented once in the final assembly file? If so, is there a way to know which contigs are likely contracted repeats?

Thank you so much for this great software!

ruanjue commented 4 years ago

wtdbg2 filtered small contigs

 --ctg-min-length <int>
   Min length of contigs to be output, 5000
 --ctg-min-nodes <int>
   Min num of nodes in a contig to be ouput, 3
melop commented 4 years ago

Thanks!