open2c / pairtools

Extract 3D contacts (.pairs) from sequencing alignments
MIT License
104 stars 32 forks source link

collect mapq stats in the pairs-stats if possible #80

Closed sergpolly closed 2 years ago

sergpolly commented 4 years ago

collect mapq stats in our stats if possible - would be useful/helpful for #78 - as there seems to be no easy drop-in mapping quality summary collector ...

maybe check samtools module though - https://multiqc.info/docs/#samtools

agalitsyna commented 2 years ago

pairtools select + pairtools stats combination covers this use case, see example distiller version that implements filters: https://github.com/open2c/distiller-nf/blob/filter_stats/distiller.nf

Phlya commented 2 years ago

I actually think it shouldn't be hard to implement, and the output would be less confusing than separate select + stats, which produces fewer total reads for mapq30, for example, and then requires stitching two separate files to get proper stats in one place... Either hardcode splitting by different mapq values, or for arbitrary filtering criteria (thinking the distiller filters!). Shall we try to squeeze it in this release?

agalitsyna commented 2 years ago

Yes, if you see an easy way to do it, than there are no restrictions!

The first problem that I anticipate is that we have a hard-coded formatting of stats dictionary into either tsv or yaml. With addition of variable output depending on mapq we'll probably have to invent something more general than what we have right now: https://github.com/open2c/pairtools/blob/40dd81c989a2fce9674ee27e96a08f1995e1d659/pairtools/lib/stats.py#L538

The second problem is that we probably want not only mapq filter but filters on other columns as well. My idea that pairtools stats can take a set of columns and conditions how to split non-catgorical columns into groups (maybe a groupby expression?). If there is a pythonic way to explain how to group the contacts, that will be general enough to perform any sophisticated type of filtering.

Phlya commented 2 years ago

Formatting for saving is indeed annoying, and I actually think we should organize the original dictionary as we want to save it, and at least with YAML we could just dump it as it is...

Re different filters: I would just use query() to subset pairs, it should be able to use exactly the same expression as the pairtools select argument I think. I'm not sure how to best pass it in the CLI to give names to the filters as well...

agalitsyna commented 2 years ago

Some of the fields in stats dictionary are numpy arrays that have to be formatted before posting into YAML. Just a warning if you decide to change it.

Re different filters: query is not the same as groupby. I guess by different stats for mapq you meant that we want separate stats for mapq<30 and mapq>30, right? If you need only query, then it's good, you may re-use the pairtools select engine: https://github.com/open2c/pairtools/blob/40dd81c989a2fce9674ee27e96a08f1995e1d659/pairtools/cli/select.py#L260-L270

Phlya commented 2 years ago

This is now implemented with arbitrary filters in pairtools stats!