pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
773 stars 274 forks source link

Base modifications test failing #1291

Closed SoapGentoo closed 3 months ago

SoapGentoo commented 3 months ago

I'm trying to package pysam 0.22.1 with htslib 1.20 (so we can get rid of 1.18 in our repo) and have run into one base modifications failure:

______________________________________________________________________________ TestBaseModifications.testChebi _______________________________________________________________________________

self = <AlignedSegment_test.TestBaseModifications testMethod=testChebi>

    def testChebi(self):
        """reference bases should always be the same nucleotide
        """
        filename = os.path.join(BAM_DATADIR, "MM-chebi.bam")
        expect = {
            ("C", 0, "m"): [(6, 102), (17, 128), (20, 153), (31, 179), (34, 204)],
            ("N", 0, "n"): [(15, 212)],
            ("C", 0, 76792): [(19, 161), (34, 187)],
        }

        with pysam.AlignmentFile(filename, check_sq=False) as inf:
            r = next(iter(inf))
>           self.assertDictEqual(r.modified_bases, expect)
E           AssertionError: None is not an instance of <class 'dict'> : First argument is not a dictionary

expect     = {('C', 0, 'm'): [(6, 102), (17, 128), (20, 153), (31, 179), (34, 204)],
 ('C', 0, 76792): [(19, 161), (34, 187)],
 ('N', 0, 'n'): [(15, 212)]}
filename   = '/var/tmp/portage/sci-biology/pysam-0.22.1/work/pysam-0.22.1-python3.12/tests/pysam_data/MM-chebi.bam'
inf        = <pysam.libcalignmentfile.AlignmentFile object at 0x7faa179c41f0>
r          = <pysam.libcalignedsegment.AlignedSegment object at 0x7faa169f4640>
self       = <AlignedSegment_test.TestBaseModifications testMethod=testChebi>

tests/AlignedSegment_test.py:1077: AssertionError
jmarshall commented 3 months ago

If you avoid capturing the output, you will see that this is preceded by an error message from htslib:

[E::bam_parse_basemod2] *: Too many entries in ML tag

This data file contains invalid data, which is diagnosed by improvements in bam_parse_basemod2() in HTSlib 1.20. Updating these base modification data files to the versions in htslib and hts-specs corrects the problem.

Thanks for the report. If you want a minimal patch to apply to correct this test case, it would be

diff --git a/tests/pysam_data/MM-chebi.sam b/tests/pysam_data/MM-chebi.sam
index 62920ec..28774d5 100644
--- a/tests/pysam_data/MM-chebi.sam
+++ b/tests/pysam_data/MM-chebi.sam
@@ -1,2 +1,2 @@
 @CO    Separate m, h and N modifications
-*  0   *   0   0   *   *   0   0   AGCTCTCCAGAGTCGNACGCCATYCGCGCGCCACCA    *   Mm:Z:C+m,2,2,1,4,1;C+76792,6,7;N+n,15;  Ml:B:C,102,128,153,179,204,161,187,212,169
+*  0   *   0   0   *   *   0   0   AGCTCTCCAGAGTCGNACGCCATYCGCGCGCCACCA    *   Mm:Z:C+m,2,2,1,4,1;C+76792,6,7;N+n,15;  Ml:B:C,102,128,153,179,204,161,187,212