pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
776 stars 274 forks source link

Support for new modified base tag #1123

Open marcus1487 opened 2 years ago

marcus1487 commented 2 years ago

The modified base tags (MM/ML) were updated not too long ago and this new tag format is breaking the current implementation. The update adds a . or ? to the end of the MM tag description (see section 1.7 in sam tags spec). To copy the text here the spec, the new tag is formatted as MM:Z:([ACGTUN][-+]([a-z]+|[0-9]+)[.?]?(,[0-9]+)*;)*. Note the additional [.?] in this as opposed to the original version. It seems that this should result in an update to this expected pattern in the code.

marcus1487 commented 2 years ago

Is there any update on support for this feature. The interface in pysam to support the modified base tag is very useful, but the lack of this support makes it very hard to use going forward.

starsyi commented 11 months ago

Is there any update on support for this feature?

marcus1487 commented 11 months ago

The lack of support for this is causing invalid results for many users attempting to parse modified base tags with pysam. Specifically with the implicit . mode highly confident canonincal calls are dropped from the modified_bases output and there is no indication that these have been dropped. This is resulting in very high observed methylation rates when this is not the case in the sample. And there is no alternative for users but to parse these tags themselves. Without support for this we will have to drop pysam usage for modified base analysis. I appreciate that this is a somewhat complex issue (e.g. what probability should be returned to the user when they have been omitted/included only implicitly?), but a fix of some sort would be much better than the current state of the API.

jmarshall commented 11 months ago

The “expected pattern” you pointed to in your initial comment is in HTSlib code. This code was updated to allow the new [.?] flags in htslib 1.16, which was imported in pysam 0.20.0 released in October 2022. So since then pysam should be accepting such syntax.

The question then would be whether the pysam AlignedSegment.modified_bases functionality exposes those flags in a useful way. As someone using this functionality, you would be best placed to say what the desired functionality for this API would be. Such details would be a useful contribution.

Perhaps you or your employer are offering to sponsor the development of improved pysam basemod functionality. If that is not the case, I am having difficulty interpreting parts of your comment — notably Without support for this we will have to drop pysam usage for modified base analysis — as anything other than the sort of unkind vaguely threatening off-topic comment that causes maintainer demotivation and burnout and will eventually lead to you being banned from this repository.

marcus1487 commented 11 months ago

I apologizes if this was interpreted as a threat in any way as this was not the intent. I truly appreciate all the hard work that goes into maintaining community tools and do not wish to diminish this effort in any way. We would certainly be happy to look into contributing to the development of this feature.

To the issue at hand, I understand that the tag with these specifiers is parsed and does not error given these new format tags. The issue is the returned modified base probabilities. From the SAM specification "When this flag is not present, or it is ‘.’, these bases should be assumed to have low probability of modification." The current implementation in htslib and pysam returns no entry for these skipped positions when the specification states that these positions should be assumed to have low probability of modification. The question now becomes how to implement this assumption. I would personally be happy for these probabilities to be returned as 0 (or equivalent to the 0 bin value where applicable). It would be good to discuss issues that other users or maintainers might have with this choice. Would adding this feature require a new attribute/array of the return value indicating where probabilities were inferred in the returned array? Would a dedicated choice aside from the 0 probability value make more sense? Once we've made these design decisions we can look into how best to look into implement this feature.

cjw85 commented 11 months ago

See previous htslib PR here: https://github.com/samtools/htslib/issues/1418#issuecomment-1090196413

We (the nebulous conglomerate of @jts, @jkbonfield and I) left the matter at the API not being completely broken, but did not have time to come back and make the API useful in the sense of conveying more information about implied calls.

jmarshall commented 11 months ago

In that case, that may mean that establishing conventions in this area is more an htslib issue than a pysam issue.

hts_base_mod_state is an opaque type, so no-one will be inspecting state->implicit directly. It is however reported by bam_mods_query_type() and bam_mods_queryi(), which pysam could use.