Hi @voutcn
This is really not an issue, but more of a query. We have generated nearly 100X data for a genome size of approx 1.2Gb and denovo assembly was built using Megahit and we had got satisfying stats. But later during different set of analysis we realized that the raw data actually has nearly 45% duplication rates and overall depth was only about 47X.
So my question and concern now is that how does Megahit handles the duplicate reads.? Is there a checkpoint where it identifies duplicate reads and discards them before the assembly.? If not how does this affect the total assembly and N50 and contig numbers.? And what would be the ideal solution during these type of problem.?
Hi @voutcn This is really not an issue, but more of a query. We have generated nearly 100X data for a genome size of approx 1.2Gb and denovo assembly was built using Megahit and we had got satisfying stats. But later during different set of analysis we realized that the raw data actually has nearly 45% duplication rates and overall depth was only about 47X.
So my question and concern now is that how does Megahit handles the duplicate reads.? Is there a checkpoint where it identifies duplicate reads and discards them before the assembly.? If not how does this affect the total assembly and N50 and contig numbers.? And what would be the ideal solution during these type of problem.?
Looking forward to hear from you.
Thank you, Vinay