pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
773 stars 274 forks source link

fetch gives wrong results on MD tag when read CRAM files with multiple_iterators=True #1290

Open fanxinping opened 3 months ago

fanxinping commented 3 months ago

When we covnert SAM/BAM to CRAM, we use store_nm=1 and store_md=1 to store NM/MD tags in the CRAM file. When reading CRAM with pysam, decode_md=0 is added to disable automatic calculation of NM/MD tags, allowing it to retrieve the NM/MD tag values stored in the CRAM file. However, for the fetch method, setting multiple_iterators to True or False will yield different NM/MD tags.

Below is a simple example of a SAM file(test.sam). READ1, READ2 and READ4 have MD tags, but READ3 and READ5 not.

@HD     VN:1.5  SO:coordinate                                                                                                      
@SQ     SN:chrM LN:16571
READ1   99      chrM    1       60      21S80M  =       177     277     CTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGC   FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFF:FFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:80 MC:Z:101M       AS:i:80 XS:i:71 XA:Z:chr17,+22020706,101M,6;
READ2   163     chrM    1       60      1S100M  =       91      191     GGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG   FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF   NM:i:0  MD:Z:100        MC:Z:101M       AS:i:100        XS:i:61
READ3   99      chrM    1       60      2S84M   =       1       84      TGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATA  FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF  MC:Z:2S84M      AS:i:84 XS:i:56
READ4   81      chrM    1       50      15S84M  =       16373   16290   TAAGACATCCCGATGGATCACAGGTCTATCACCCTATTACCCACTCACGGGAGCTCTCCATGCATTTGGTATTGTCGTCTGGGGGGTGTGCACGCGATA     FF:F,F,FF,:F:FF,F,:,FF,:F:FF,FFFFF:FFFF,F:FF,,FFFF,FF:FFFFFF,F:F,F:F::F,F,F:FFFFFFFFF,F:,FF,FFFFFF,     NM:i:2  MD:Z:24A33T25   MC:Z:101M       AS:i:74 XS:i:54
READ5   147     chrM    1       60      2S84M   =       1       -84     TGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATA  :FFFFFFFFFFFFFFFFFFFFFF:FFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFF  MC:Z:2S84M      AS:i:84 XS:i:56

We can use samtools view to convert this SAM file to CRAM file and store NM/MD tag verbatim. Please note that tabs might be converted to spaces if you directly copy the example above.

samtools view -C -T ucsc.hg19.asta --output-fmt-option store_md=1 --output-fmt-option store_nm=1 -o test.cram test.sam
samtools index test.cram

Then, we use fetch to get reads

  1. without multiple_iterators=True
    import pysam
    f = pysam.AlignmentFile('test.cram', 'rc', format_options=[b"decode_md=0"])
    for read in f.fetch("chrM", 1, 2):
    print(read)
    f.close()

    and we get:

    READ1   99  #0  1   60  21S80M  #0  177 277 CTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGC   array('B', [37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37])    [('NM', 0), ('MD', '80'), ('MC', '101M'), ('AS', 80), ('XS', 71), ('XA', 'chr17,+22020706,101M,6;')]
    READ2   163 #0  1   60  1S100M  #0  91  191 GGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG   array('B', [37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37])    [('NM', 0), ('MD', '100'), ('MC', '101M'), ('AS', 100), ('XS', 61)]
    READ3   99  #0  1   60  2S84M   #0  1   84  TGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATA  array('B', [37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37])    [('MC', '2S84M'), ('AS', 84), ('XS', 56)]
    READ4   81  #0  1   50  15S84M  #0  16373   16290   TAAGACATCCCGATGGATCACAGGTCTATCACCCTATTACCCACTCACGGGAGCTCTCCATGCATTTGGTATTGTCGTCTGGGGGGTGTGCACGCGATA array('B', [37, 37, 25, 37, 11, 37, 11, 37, 37, 11, 25, 37, 25, 37, 37, 11, 37, 11, 25, 11, 37, 37, 11, 25, 37, 25, 37, 37, 11, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 11, 37, 25, 37, 37, 11, 11, 37, 37, 37, 37, 11, 37, 37, 25, 37, 37, 37, 37, 37, 37, 11, 37, 25, 37, 11, 37, 25, 37, 25, 25, 37, 11, 37, 11, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 25, 11, 37, 37, 11, 37, 37, 37, 37, 37, 37, 11])    [('NM', 2), ('MD', '24A33T25'), ('MC', '101M'), ('AS', 74), ('XS', 54)]
    READ5   147 #0  1   60  2S84M   #0  1   -84 TGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATA  array('B', [25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37])    [('MC', '2S84M'), ('AS', 84), ('XS', 56)]

    READ3 and READ5 don't have MD tags, which is expected.

  2. with multiple_iterators=True
    import pysam
    f = pysam.AlignmentFile('test.cram', 'rc', format_options=[b"decode_md=0"])
    for read in f.fetch("chrM", 1, 2, multiple_iterators=True):
    print(read)
    f.close()

    and we get:

    READ1   99  #0  1   60  21S80M  #0  177 277 CTTAAATAAGACATCACGATGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGC   array('B', [37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37])    [('NM', 0), ('MD', '80'), ('MC', '101M'), ('AS', 80), ('XS', 71), ('XA', 'chr17,+22020706,101M,6;')]
    READ2   163 #0  1   60  1S100M  #0  91  191 GGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATAGCATTGCGAGACGCTG   array('B', [37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37])    [('NM', 0), ('MD', '100'), ('MC', '101M'), ('AS', 100), ('XS', 61)]
    READ3   99  #0  1   60  2S84M   #0  1   84  TGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATA  array('B', [37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37])    [('MC', '2S84M'), ('AS', 84), ('XS', 56), ('MD', '84'), ('NM', 0)]
    READ4   81  #0  1   50  15S84M  #0  16373   16290   TAAGACATCCCGATGGATCACAGGTCTATCACCCTATTACCCACTCACGGGAGCTCTCCATGCATTTGGTATTGTCGTCTGGGGGGTGTGCACGCGATA array('B', [37, 37, 25, 37, 11, 37, 11, 37, 37, 11, 25, 37, 25, 37, 37, 11, 37, 11, 25, 11, 37, 37, 11, 25, 37, 25, 37, 37, 11, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 11, 37, 25, 37, 37, 11, 11, 37, 37, 37, 37, 11, 37, 37, 25, 37, 37, 37, 37, 37, 37, 11, 37, 25, 37, 11, 37, 25, 37, 25, 25, 37, 11, 37, 11, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 11, 37, 25, 11, 37, 37, 11, 37, 37, 37, 37, 37, 37, 11])    [('NM', 2), ('MD', '24A33T25'), ('MC', '101M'), ('AS', 74), ('XS', 54)]
    READ5   147 #0  1   60  2S84M   #0  1   -84 TGGATCACAGGTCTATCACCCTATTAACCACTCACGGGAGCTCTCCATGCATTTGGTATTTTCGTCTGGGGGGTGTGCACGCGATA  array('B', [25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 25, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37])    [('MC', '2S84M'), ('AS', 84), ('XS', 56), ('MD', '84'), ('NM', 0)]

    All reads have MD tags, which is unexpected.

I checked the source code of pysam, and found that the issue arises when multiple_iterators=True. It opens a new CRAM file object, but it doesn't strictly adhere to the original opening options, for example, it doesn't use the format_options. Due to the absence of setting decode_md=0, all reads' NM/MD tags will be automatically calculated, resulting in all reads containing NM/MD tags. https://github.com/pysam-developers/pysam/blob/43c10664834cf3914000a744e6057826c6a6fa65/pysam/libcalignmentfile.pyx#L2013

Thanks!