Closed sergpolly closed 6 years ago
awesome suggestions!
1) the suggested change in the stats interface is great, since it makes the API much more transparent - it's nice to show explicitly which characteristics of the alignment we actually use when we calculate statistics. It will simplify re-using stats in other contexts.
2) indeed, using cols_buffer instead of line_buffer to save one splitting operation is nice! The issue, however, is that we do not add unmapped lines to cols_buffer, so we'd have to add it in a separate command.
3) finally, the question is where to calculate the duplicate statistics. The two options are: a) keep it the same way it is right now - calculate dup/dedup statistics in dedup and insert the final data into the stats object. pro: it's a minimal change from what we have right now, cons: it spreads calculation of statistics across different modules b) rewrite stats to calculate dup/dedup numbers inside it, by checking the pair_type flag (we tag duplicates with a DD flag). pros: it nicely localizes statistics calculation in a single module, cons: we won't be able to only output dup/dedup statistics, and will have to calculate the full statistics every time we do dedup. Also, right now tagging of duplicate reads with DD is optional and is performed only when we output duplicates into a separate file. With the proposed scheme, we'd have to tag them everytime, which, probably is not a big deal. Not sure what the optimal solution is. @nvictus ?..
@golobor So I've finally got to something about this issue, but I started small and addressed only (1)-item from your list. I'm still digesting (2) and (3)
With (3b) do you mean we should get rid of https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L289 https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L316 https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L318
and "hide" everything inside stats
, by enabling add_pair
to keep track of unmapped/dups ?
In this case, cols_buffer
is not sufficient as it does not contain "bad" pairs ...
Maybe in this case we can use parsed cols
list, since we'll be feeding all pairs to add_pairs
regardless if it's "normal" or not :
https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L278
(3a) in that case, imply keeping https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L289 https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L316 https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L318
and instead using add_pair
to keep track of "normal" pairs only, for which cols_buffer
seems to be sufficient? But we'd have to accumulate that buffer not just for if mark_dups:
, but rather always ...
I personally favor (3b) - code would be more consistent, kind of. Stats belong to one place, kind of - would be easier to address https://github.com/mirnylab/pairsamtools/issues/54
by relying on a rigid structure of stats object, by rigid I mean with the unmapped
and dups
files predefined @nvictus , @golobor what do you think?
re: 3(b), yes, your understanding is right, calculate all statistics via add_stats
.
I too now think that we should with (3b), b/c it keeps the code tidy(er).
In order to address https://github.com/mirnylab/distiller-nf/issues/80 we agreed to add stats to the
dedup
.And this: https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_dedup.py#L315 looks like a proper place to add something like
out_stat.add_pair(algn2, algn1, pair_type)
Techincal Qs:
line_buffer[i]
right next to where we're writing it tooutstream
?cols_buffer[i]
(used for mark duplicates) for stats and skip parsingline_buffer[i]
?stats
-API, and instead ofout_stat.add_pair(algn2, algn1, pair_type)
, sayout_stat.add_pair(c2, p2, s2, c1, p1, s1, pair_type)
, wherealgn={'chrom':c,'pos':p,'strand':s}
? Such an API change would simplifystats
module itself, here https://github.com/mirnylab/pairsamtools/blob/e725dbbd037f169a5def3891783f7b2cf3922463/pairsamtools/pairsam_stats.py#L78 It was written the way it is right now, in order to please and simplifyparse
-code, i.e. avoid unpackingalign
dictionaries there.What do you @golobor , @nvictus guys think ?