nservant / HiC-Pro

HiC-Pro: An optimized and flexible pipeline for Hi-C data processing
Other
372 stars 181 forks source link

How to delete duplicate pairs when using hicpro at high resolution? #636

Open xingql983 opened 2 weeks ago

xingql983 commented 2 weeks ago

Due to issues with our library construction method, we can obtain matrices with a resolution lower than 500 bp. At the same time, the insert sizes of the reads we obtained are between 200-300 bp. This led me to wonder how HiCPro handles duplications caused by short fragments. I then checked the code used in the HiCPro process and found this segment that deals with duplicate pairs:

*sort -T ${TMP_DIR} -S 50% -k2,2V -k3,3n -k5,5V -k6,6n -m ${IN_DIR}/${RES_FILE_NAME}/.validPairs | \ awk -F\"\t\" 'BEGIN{c1=0;c2=0;s1=0;s2=0}(c1!=\$2 || c2!=\$5 || s1!=\$3 || s2!=\$6){print;c1=\$2;c2=\$5;s1=\$3;s2=\$6}' > ${DATA_DIR}/${RES_FILE_NAME}/${RES_FILE_NAME}.allValidPairs"**

My understanding is that this script deletes PCR duplicates based on the chromosomal positions to which the fragments in the validPairs records have aligned. Is that correct? Additionally, I am concerned about potential biases with short reads. For instance, if their sequences are not identical but are very close, with only a few base pairs difference, yet their recorded positions in the validPairs file are the same, could it be that fragments, which are not actual PCR duplicates, are mistakenly identified and deleted as such? I look forward to your response!