open2c / pairtools

Extract 3D contacts (.pairs) from sequencing alignments
MIT License
104 stars 32 forks source link

adding stats to dedup #55

Closed sergpolly closed 6 years ago

sergpolly commented 7 years ago

In order to address https://github.com/mirnylab/distiller-nf/issues/80 we agreed to add stats to the dedup.

And this: https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L315 looks like a proper place to add something like out_stat.add_pair(algn2, algn1, pair_type)

Techincal Qs:

What do you @golobor , @nvictus guys think ?

golobor commented 7 years ago

awesome suggestions!

1) the suggested change in the stats interface is great, since it makes the API much more transparent - it's nice to show explicitly which characteristics of the alignment we actually use when we calculate statistics. It will simplify re-using stats in other contexts.

2) indeed, using cols_buffer instead of line_buffer to save one splitting operation is nice! The issue, however, is that we do not add unmapped lines to cols_buffer, so we'd have to add it in a separate command.

3) finally, the question is where to calculate the duplicate statistics. The two options are: a) keep it the same way it is right now - calculate dup/dedup statistics in dedup and insert the final data into the stats object. pro: it's a minimal change from what we have right now, cons: it spreads calculation of statistics across different modules b) rewrite stats to calculate dup/dedup numbers inside it, by checking the pair_type flag (we tag duplicates with a DD flag). pros: it nicely localizes statistics calculation in a single module, cons: we won't be able to only output dup/dedup statistics, and will have to calculate the full statistics every time we do dedup. Also, right now tagging of duplicate reads with DD is optional and is performed only when we output duplicates into a separate file. With the proposed scheme, we'd have to tag them everytime, which, probably is not a big deal. Not sure what the optimal solution is. @nvictus ?..

sergpolly commented 6 years ago

@golobor So I've finally got to something about this issue, but I started small and addressed only (1)-item from your list. I'm still digesting (2) and (3)

With (3b) do you mean we should get rid of https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L289 https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L316 https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L318

and "hide" everything inside stats, by enabling add_pair to keep track of unmapped/dups ? In this case, cols_buffer is not sufficient as it does not contain "bad" pairs ... Maybe in this case we can use parsed cols list, since we'll be feeding all pairs to add_pairs regardless if it's "normal" or not : https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L278

(3a) in that case, imply keeping https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L289 https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L316 https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L318

and instead using add_pair to keep track of "normal" pairs only, for which cols_buffer seems to be sufficient? But we'd have to accumulate that buffer not just for if mark_dups:, but rather always ...

sergpolly commented 6 years ago

I personally favor (3b) - code would be more consistent, kind of. Stats belong to one place, kind of - would be easier to address https://github.com/mirnylab/pairsamtools/issues/54 by relying on a rigid structure of stats object, by rigid I mean with the unmapped and dups files predefined @nvictus , @golobor what do you think?

golobor commented 6 years ago

re: 3(b), yes, your understanding is right, calculate all statistics via add_stats. I too now think that we should with (3b), b/c it keeps the code tidy(er).