pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
773 stars 274 forks source link

Does not parse MM tag properly with when '.' ambiguity character is present #1269

Closed ddubocan closed 1 month ago

ddubocan commented 6 months ago

Reads with modifications were basecalled using dorado v5.3.

Here is an example of the MM/ML tag from a read:

MM:Z:A+a.,1,0,0,0,0,0,0,1,0,1,3,0,2,0,3,0,1,3,1,26,3,0,1,0,19,5,0,1,4,0,1,0,0,1,0,0,0,1,6,14,3,2,0,0,0,5,39,3,10,1,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,4,0,0,0,0,0,0,0,0,1,0,6,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0;C+h.,12,9,15,3,1,1,4,0,0,3,0,5,0,1,8,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0;C+m.,12,9,15,3,1,1,4,0,0,3,0,5,0,1,8,0,0,0,0,1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0; ML:B:C,28,39,88,198,248,178,37,12,97,21,22,33,207,114,14,14,34,21,12,15,17,12,12,19,37,25,143,129,13,25,13,15,14,233,30,29,41,38,34,14,249,86,215,242,255,255,15,46,13,28,152,38,64,67,30,15,18,117,99,66,64,255,152,71,244,227,171,75,172,189,13,15,255,198,242,253,25,20,14,44,72,168,21,42,171,231,192,67,232,52,47,78,17,27,48,69,66,55,37,35,15,145,175,47,37,63,35,29,60,112,85,105,70,176,181,135,151,92,215,255,59,39,255,250,248,72,252,129,239,255,255,250,231,236,47,26,64,171,215,252,255,57,5,2,12,5,10,2,13,26,13,4,5,3,5,1,22,12,13,0,1,29,12,0,31,46,138,126,15,93,61,25,20,28,37,35,29,94,25,230,238,123,14,15,0,14,12,18,13,20,12,40,31,17,12,13,14,14,26,20,16,19,14,254,73,13,255,8,19,20,27,17,161,32,30,24,34,218,42,25,52,21,3,16,18,21,24,255

When I try to extract basemods with read.modified_bases I get this error: [E::bam_parse_basemod] Insufficient number of entries in ML tag

I can fix this error by running:

samtools view -h bam | sed -e 's/A+a\./A+a/g' -e 's/C+m\./C+m/g' -e 's/C+h\./C+h/g' | samtools view -b > fixed_bam

So I am assuming that under-the-hood as of right now pysam is not parsing the ambiguity character in the MM tag properly.

jmarshall commented 3 months ago

I am unable to reproduce this with current pysam. Please specify what version of pysam you are using and provide a complete SAM record (including FLAG and SEQ in particular) demonstrating the problem.

jmarshall commented 1 month ago

HTSlib 1.16 implemented support for these . and ? ambiguity suffixes. Pysam 0.20.0 onwards wrap HTSlib 1.16 or later; pysam 0.19.0 and 0.19.1 wrapped HTSlib 1.15 and 1.15.1, which included basic MM tag support but not support for these ambiguity suffixes. (Furthermore pysam 0.22.0 and later never produce an error message with the text you reported, so you must have been using a version of pysam prior to that.)

When I test with files containing these ambiguity suffixes, pysam 0.20.0 and later produce appropriate results while 0.19.0 and 0.19.1 produce the error message you saw.

Hence I believe you were using either pysam 0.19.0 or pysam 0.19.1, and can resolve your issue by upgrading to a version of pysam released in October 2022 or later that actually supports these ambiguity suffixes.