adding stats to dedup - Githubissues

sergpolly commented 7 years ago

In order to address https://github.com/mirnylab/distiller-nf/issues/80 we agreed to add stats to the dedup.

And this: https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L315 looks like a proper place to add something like out_stat.add_pair(algn2, algn1, pair_type)

Techincal Qs:

should we parse line_buffer[i] right next to where we're writing it to outstream?
should we instead activate cols_buffer[i] (used for mark duplicates) for stats and skip parsing line_buffer[i]?
stats-related: should we change stats-API, and instead of out_stat.add_pair(algn2, algn1, pair_type), say out_stat.add_pair(c2, p2, s2, c1, p1, s1, pair_type), where algn={'chrom':c,'pos':p,'strand':s}? Such an API change would simplify stats module itself, here https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_stats.py#L78 It was written the way it is right now, in order to please and simplify parse-code, i.e. avoid unpacking align dictionaries there.

What do you @golobor , @nvictus guys think ?

golobor commented 7 years ago

awesome suggestions!

1) the suggested change in the stats interface is great, since it makes the API much more transparent - it's nice to show explicitly which characteristics of the alignment we actually use when we calculate statistics. It will simplify re-using stats in other contexts.

2) indeed, using cols_buffer instead of line_buffer to save one splitting operation is nice! The issue, however, is that we do not add unmapped lines to cols_buffer, so we'd have to add it in a separate command.

3) finally, the question is where to calculate the duplicate statistics. The two options are: a) keep it the same way it is right now - calculate dup/dedup statistics in dedup and insert the final data into the stats object. pro: it's a minimal change from what we have right now, cons: it spreads calculation of statistics across different modules b) rewrite stats to calculate dup/dedup numbers inside it, by checking the pair_type flag (we tag duplicates with a DD flag). pros: it nicely localizes statistics calculation in a single module, cons: we won't be able to only output dup/dedup statistics, and will have to calculate the full statistics every time we do dedup. Also, right now tagging of duplicate reads with DD is optional and is performed only when we output duplicates into a separate file. With the proposed scheme, we'd have to tag them everytime, which, probably is not a big deal. Not sure what the optimal solution is. @nvictus ?..

sergpolly commented 6 years ago

@golobor So I've finally got to something about this issue, but I started small and addressed only (1)-item from your list. I'm still digesting (2) and (3)

With (3b) do you mean we should get rid of https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L289 https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L316 https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L318

and "hide" everything inside stats, by enabling add_pair to keep track of unmapped/dups ? In this case, cols_buffer is not sufficient as it does not contain "bad" pairs ... Maybe in this case we can use parsed cols list, since we'll be feeding all pairs to add_pairs regardless if it's "normal" or not : https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L278

(3a) in that case, imply keeping https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L289 https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L316 https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L318

and instead using add_pair to keep track of "normal" pairs only, for which cols_buffer seems to be sufficient? But we'd have to accumulate that buffer not just for if mark_dups:, but rather always ...

sergpolly commented 6 years ago

I personally favor (3b) - code would be more consistent, kind of. Stats belong to one place, kind of - would be easier to address https://github.com/mirnylab/pairsamtools/issues/54 by relying on a rigid structure of stats object, by rigid I mean with the unmapped and dups files predefined @nvictus , @golobor what do you think?

golobor commented 6 years ago

re: 3(b), yes, your understanding is right, calculate all statistics via add_stats. I too now think that we should with (3b), b/c it keeps the code tidy(er).

open2c / pairtools

adding stats to dedup #55