pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
773 stars 274 forks source link

[Feature request] option to ignore soft-clipped parts in get_aligned_pairs() #1292

Open LaraFuhrmann opened 2 months ago

LaraFuhrmann commented 2 months ago

Dear developers,

thanks a lot for this great package, we are using it extensively for our scripts.

We are wondering if it would be possible to add the option to ignore or "clip-out" the soft-clipped parts of the reads in the function get_aligned_pairs?

Or is there an easy other way to do that without checking the cigar strings?

Thanks a lot, Lara

DrYak commented 2 months ago

I've added PR #1293 to cover this need, and expanded the tests to cover this new code path.

jmarshall commented 2 months ago

I don't believe there is a good way to do this at the moment, other than by looking at the CIGAR strings by hand as you suggest.

Thank you for your PR @DrYak but I would like to consider a more general approach instead: an option (e.g. with_cigar) for get_aligned_pairs() that would add another entry to the returned tuples that would tell you the CIGAR operator that each particular position tuple corresponds to. Probably giving just the enum CIGAR_OPS value, not the length as well.

So @LaraFuhrmann would ignore the tuples that say CSOFT_CLIP. And the facility may be useful for other use cases.

DrYak commented 2 months ago

@jmarshall that's indeed a more elegant solution covering more uses case.

I had at some point though about such an approach but went for the above PR to keep the changeset small.

I'll loko into your suggestion, but I fear more tests will need to be written :sweat_smile:

DrYak commented 2 months ago

I had a bit of time this evening after work, and wrote another quick PR.

consider a more general approach instead: an option (e.g. with_cigar) for get_aligned_pairs() that would add another entry to the returned tuples that would tell you the CIGAR operator that each particular position tuple corresponds to.

Now available in PR #1294 Tests updated (but much more work than the previous PR :sweat_smile: )

DrYak commented 1 month ago

Hi, @jmarshall! Did you had time to check if my proposal of with_cigar (as you mentioned above) looks good to you? (And are you happy with the tests?) Thanks!

DrYak commented 3 weeks ago

Hi @jmarshall! Just a gentle reminder to ask if you think you could find some time to have a look at PR #1294 ? Even if you end-up not merging that but developing it on your own, it would greatly help Lara and me to know if this is the direction PySam might go in the future, so we could plan our software accordingly. Thank you very much for your patience.