Closed sergpolly closed 2 years ago
pairtools select + pairtools stats combination covers this use case, see example distiller version that implements filters: https://github.com/open2c/distiller-nf/blob/filter_stats/distiller.nf
I actually think it shouldn't be hard to implement, and the output would be less confusing than separate select + stats, which produces fewer total reads for mapq30, for example, and then requires stitching two separate files to get proper stats in one place... Either hardcode splitting by different mapq values, or for arbitrary filtering criteria (thinking the distiller filters!). Shall we try to squeeze it in this release?
Yes, if you see an easy way to do it, than there are no restrictions!
The first problem that I anticipate is that we have a hard-coded formatting of stats dictionary into either tsv or yaml. With addition of variable output depending on mapq we'll probably have to invent something more general than what we have right now: https://github.com/open2c/pairtools/blob/40dd81c989a2fce9674ee27e96a08f1995e1d659/pairtools/lib/stats.py#L538
The second problem is that we probably want not only mapq
filter but filters on other columns as well. My idea that pairtools stats
can take a set of columns and conditions how to split non-catgorical columns into groups (maybe a groupby expression?). If there is a pythonic way to explain how to group the contacts, that will be general enough to perform any sophisticated type of filtering.
Formatting for saving is indeed annoying, and I actually think we should organize the original dictionary as we want to save it, and at least with YAML we could just dump it as it is...
Re different filters: I would just use query()
to subset pairs, it should be able to use exactly the same expression as the pairtools select
argument I think. I'm not sure how to best pass it in the CLI to give names to the filters as well...
Some of the fields in stats dictionary are numpy arrays that have to be formatted before posting into YAML. Just a warning if you decide to change it.
Re different filters: query
is not the same as groupby
. I guess by different stats for mapq you meant that we want separate stats for mapq<30
and mapq>30
, right?
If you need only query
, then it's good, you may re-use the pairtools select
engine: https://github.com/open2c/pairtools/blob/40dd81c989a2fce9674ee27e96a08f1995e1d659/pairtools/cli/select.py#L260-L270
This is now implemented with arbitrary filters in pairtools stats
!
collect mapq stats in our stats if possible - would be useful/helpful for #78 - as there seems to be no easy drop-in mapping quality summary collector ...
maybe check samtools module though - https://multiqc.info/docs/#samtools