tangerzhang / ALLHiC

ALLHiC: phasing and scaffolding polyploid genomes based on Hi-C data
174 stars 39 forks source link

PCR duplicates #101

Closed fergsc closed 3 years ago

fergsc commented 3 years ago

Hi, I am attempting to use the ALLHiC pipeline on a diploid highly repetitive plant genome I have assembled. My Hi-C contains a LOT of PCR duplication, is it recommended to remove these reads before running the ALLHiC pipeline?

thanks.

tangerzhang commented 3 years ago

Hi @fergsc The PCR duplication provides limited information for Hi-C scaffolding and may even introduce noisy signals. Therefore, it is better to remove these reads before running ALLHiC pipeline.

fergsc commented 3 years ago

Thanks, I shall remove them during alignment.

distilledchild commented 2 years ago

@tangerzhang Based on the github sourcecode, it's now commented, right? So, you mean, regardless of dioploid genome or polyploid, it is recommended to discard duplicated reads, right? But I was confused that the author's message on the source code is saying that it would not. "NOTE: As of August 24, 2013, I'm no longer removing PCR duplicates..."

tangerzhang commented 2 years ago

Hi @theshowmustgolangon I apologize for my confusing answer. As far as I know, many sequencing companies adopt a PCR-free approach to constructing libraries, and theoretically, there should be few PCR duplicates in the sequencing data, which may have limited influence on the results of Hi-C scaffolding. Yes, we can remove these PCR duplicated reads but it is possibly not necessary as the recently sequenced reads do not contain many PCR duplicates.