nservant / HiC-Pro

HiC-Pro: An optimized and flexible pipeline for Hi-C data processing
Other
382 stars 183 forks source link

HiC-Pro does not perform deduplication even when RMDUP=1 is set in the config file unless the 'merge-persample' option is used, but this is not run by default and the option to do is not documented on the github page #307

Closed gbonora closed 4 years ago

gbonora commented 4 years ago

Hi,

I recently realized that HiC-Pro (v2.11.1) does not deduplicate read pairs even when RMDUP=1 is set in the config file unless the 'merge-persample' option is also used. However, the 'merge-persample' analysis step option is not run by default and the option to do is not documented in the 'How to use it ?' section on the github page, although it is described by the HiC-Pro's help (see attached slide).

I think it would be helpful to describe the 'merge-persample' analysis step option under the 'How to use it ?' section on you github page and to make it clear that this analysis step option is necessary for deduplication.

Thanks.

gb_20200123.pdf

nservant commented 4 years ago

Hi, Indeed the duplicates removal is performed at the merge-persample step of the pipeline. But you're right, there is mistake in the help page. I will change that for the next version

hwick commented 4 years ago

I thought I was having this same issue because the .Rstat file Valid_interaction_pairs number appears to include duplicate reads in the total, but if you check the .mergestat file it specifies valid_interaction and valid_interaction_rmdup totals which reflects the number before and after duplicates are removed. If you wc -l the .allValidPairs file it should match the rmdup file if duplicates are removed. I just ran with -s merge_persample to double check and the .allValidPairs file resulting from that has the same number of reads as the rmdup total and the original .allValidPairs file.