open2c / pairtools

Extract 3D contacts (.pairs) from sequencing alignments
MIT License
104 stars 32 forks source link

Dedup parameters #183

Closed gdolsten closed 1 year ago

gdolsten commented 1 year ago

Is there a reason why the default parameters are:

Pairs with both sides mapped within this distance (bp) from each other are considered duplicates. [dedup option] [default: 3]?

In particular, is there a reason why deduplication doesn't remove perfectly identical read pairs?

Phlya commented 1 year ago

I am not sure it has ever been published, one can observe a slight enrichment of pairs within ~3bp of each other, suggesting some steps of the library prep might add or remove 1-3 bp at the ends of fragments. Not sure how widespread it is and how it varies between protocols... That said, it wasn't a huge effect, and the vast majority of duplicates are exact matches, so it's fine to set it to 0.

gdolsten commented 1 year ago

Ok, great thanks. Pairtools keeps one of the representative duplicates though, correct?

Phlya commented 1 year ago

Of course. It keeps the first one it encounters.

Phlya commented 1 year ago

Assume this is solved now, feel free to reopen!