open2c / pairtools

Extract 3D contacts (.pairs) from sequencing alignments
MIT License
99 stars 32 forks source link

Feature request - split parse from filters #217

Closed bskubi closed 6 months ago

bskubi commented 6 months ago

To facilitate reprocessing data, it would be helpful if the pairtools suite permitted filtering an existing .pairs file using the same filters available in the parse and parse2 commands, such as --max-insert-size or max-inter-align-gap.

One way to do this might be to permit parse/parse2 to accept .pairs as input. Another way might be to split the filtering features into a separate command, and have the parse/parse2 commands bundle sam/bam -> pairs parsing with the separate filtering command.

I have an alternative solution which would be to reverse-parse .pairs to a .sam format, then stream it into parse/parse2. That's pretty hackish, so please let me know if you're seriously considering implementing this feature. Thanks for making this suite of useful tools!

Phlya commented 6 months ago

Can't they be implemented using pairtools select? https://pairtools.readthedocs.io/en/latest/cli_tools.html#pairtools-select

Phlya commented 6 months ago

Well, I suppose information needed for --max-inter-align-gap is lost in the pairs format, so I guess it's not possible...

golobor commented 6 months ago

Hi Benjamin! Thank you for the suggestion! At the moment, I am not sure if implementing these options as a filter would be possible. The issue is that these flags (-max-insert-size, --max-molecule-size, and --max-inter-align-gap) control how alignments are interpreted and parsed into pairs.

Specifically, --max-insert-size and --max-molecule-size control how pairs are "rescued". "Rescue" is the merging of two inner-most alignments on the opposite sides of the read if they look like they come from a single DNA fragment. Depending of the value of these two flags, we may have either two alignments that will together form a pair, or, a single alignment that would form a pair with another adjacent alignment. These outcomes seem mutually exclusive to me, so at the moment I don't see how we could convert these options into columns and perform filtering. --max-inter-align-gap has the same issue - it controls if gaps between alignments should be converted into "empty" aligments, that would form their own pairs with adjacent alignments. Again, I do not see how we could perform filtering on this. I am very open to any suggestions, though!

bskubi commented 6 months ago

Thanks for the quick response! What you say makes sense. It sounds like it's best to think of parsing .sam/.bam to .pairs as more or less irreversible.