nservant / HiC-Pro

HiC-Pro: An optimized and flexible pipeline for Hi-C data processing
Other
386 stars 182 forks source link

How to delete duplicate pairs when using hicpro at high resolution? #636

Open xingql983 opened 5 months ago

xingql983 commented 5 months ago

Due to issues with our library construction method, we can obtain matrices with a resolution lower than 500 bp. At the same time, the insert sizes of the reads we obtained are between 200-300 bp. This led me to wonder how HiCPro handles duplications caused by short fragments. I then checked the code used in the HiCPro process and found this segment that deals with duplicate pairs:

*sort -T ${TMP_DIR} -S 50% -k2,2V -k3,3n -k5,5V -k6,6n -m ${IN_DIR}/${RES_FILE_NAME}/.validPairs | \ awk -F\"\t\" 'BEGIN{c1=0;c2=0;s1=0;s2=0}(c1!=\$2 || c2!=\$5 || s1!=\$3 || s2!=\$6){print;c1=\$2;c2=\$5;s1=\$3;s2=\$6}' > ${DATA_DIR}/${RES_FILE_NAME}/${RES_FILE_NAME}.allValidPairs"**

My understanding is that this script deletes PCR duplicates based on the chromosomal positions to which the fragments in the validPairs records have aligned. Is that correct? Additionally, I am concerned about potential biases with short reads. For instance, if their sequences are not identical but are very close, with only a few base pairs difference, yet their recorded positions in the validPairs file are the same, could it be that fragments, which are not actual PCR duplicates, are mistakenly identified and deleted as such? I look forward to your response!

nservant commented 2 months ago

Your understanding about how HiC-pro is filtering duplicates is correct. However, I'm not sure to get your point when you say ; "if their sequences are not identical but are very close, with only a few base pairs difference, yet their recorded positions in the validPairs file are the same" ? If the two reads are mapped at different loci, their position will be different in the valid pairs, so they will not be considered as duplicates ...