voutcn / megahit

Ultra-fast and memory-efficient (meta-)genome assembler
http://www.ncbi.nlm.nih.gov/pubmed/25609793
GNU General Public License v3.0
588 stars 134 forks source link

High Duplicates reads- How does it affects the assembly and process.? #304

Open stachyris opened 3 years ago

stachyris commented 3 years ago

Hi @voutcn This is really not an issue, but more of a query. We have generated nearly 100X data for a genome size of approx 1.2Gb and denovo assembly was built using Megahit and we had got satisfying stats. But later during different set of analysis we realized that the raw data actually has nearly 45% duplication rates and overall depth was only about 47X.

So my question and concern now is that how does Megahit handles the duplicate reads.? Is there a checkpoint where it identifies duplicate reads and discards them before the assembly.? If not how does this affect the total assembly and N50 and contig numbers.? And what would be the ideal solution during these type of problem.?

Looking forward to hear from you.

Thank you, Vinay