pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
774 stars 274 forks source link

Segmentation fault when calling `get_aligned_pairs(with_seq = True)` #1226

Closed isaacvock closed 12 months ago

isaacvock commented 12 months ago

Hello,

When working with a particular bam file using pysam, I came across a strange error that only occurs when calling r.get_aligned_pairs(with_seq = True) on certain reads in the bam file. For those reads, a segmentation fault is thrown and Python crashes. I have attached a sam file that includes 4 reads, 1 of which causes the segmentation fault error when running get_aligned_pairs, though only if the with_seq option is True. I originally came across this problem with pysam version 0.16.0, but was able to reproduce it with version 0.21.0 as well. I am using Python version 3.8.6 and have reproduced the error on the windows subsystem for linux as well as a shared cluster running Red Hat Enterprise Linux version 8.6. I have also confirmed that the error reproduces on the WSL with pysam version 0.21.0 and Python version 3.10.2.

Sam file for reproducible example (gzipped to make Github happy):

problem_child.sam.gz

Reproducible example:

import pysam

# Load sam file
problem = pysam.AlignmentFile('problem_child.sam', 'r')

### Read that gives no problems

r = next(problem)

# Runs without error
r.get_aligned_pairs()

# Runs without error
r.get_aligned_pairs(with_seq = True)

### Read that throws Segmentation fault
r = next(problem)

# Runs without error
r.get_aligned_pairs()

# Errors with Segmentation fault
r.get_aligned_pairs(with_seq = True)

I am happy to provide any additional information. I apologize if I have missed an existing issue report about this same problem, I checked but did not see any.

UPDATE: After combing through the bam file and finding more problematic reads, it seems like the problem often arises for a read with a CIGAR string of 31S9M or 9M31S. Of the 21 problematic reads I was able to chase down, 31S9M or 9M31S was the CIGAR string of 20 of them. There was one instance where a read with the CIGAR string 4S8M28S also was problematic (so a similar fraction of softclipped and matched reads). After a bit more digging though, I confirmed that not all reads with a CIGAR string of 31S9M or 9M31S throw a segfault error with .get_aligned_pairs(with_seq = True), so still pretty idiosyncratic...

Best, Isaac

jmarshall commented 12 months ago

Thanks for the comprehensive bug report.

It happens that the MD tag's value in the failing record in your test file is 9 — one character in length — and in this record it is written as MD:A:9, a single-character field. Normally, and according to the spec, these are string fields written as MD:Z:… and the pysam code expects only this field type.

So certainly pysam should raise an exception instead of crashing when it sees MD:A:…. As an extension, I'm tempted to have it instead accept it and interpret it as a string in this case, in which case your test file would work as expected.

But the real question is how this strictly speaking invalid field got into your SAM file.

I've looked at the STAR source code and it is unlikely to write an MD field like this. Samtools would not. Are there other programs in your pipeline not reflected in your test file's @PG headers that may have caused this?

There is a behaviourally related pysam bug (#899) that I mean to address sometime, but AIUI it would change the type in the other direction. So that is probably not the culprit.

isaacvock commented 12 months ago

Thank you for getting back to me so quickly with a detailed answer. The one other program that is not reflected in the @PG header is a custom HTSeq script that calls features the read overlaps (this is what the GF, EF, and XF tags in the provided sam file refer to). I checked the bam files produced by STAR and they in fact do have the proper MD:Z:... . The culprit in the custom htseq script is a function (borrowed from the actual HTSeq repo) that uses pysam to create a new sam file that gets passed to the mutation counting script where the problems cropped up:

    def write_to_samout(r, assignment, samoutfile, template=None):
        '''
        assignment: is a list of strings
        '''
        if samoutfile is None:
            return
        if not pe_mode:
            r = (r,)
        tags = ['GF','EF','XF','XG','XH','XI','XJ','XK','XL','XN','XO','XP','XQ','XR','XS','XT','XU']
        for read in r:
            if read is not None:
                for p in range(0, len(assignment)):
                    read.optional_fields.append((tags[p], assignment[p]))
                if samout_format in ('SAM', 'sam'):
                    samoutfile.write(read.get_sam_line() + "\n")
                else:
                    samoutfile.write(read.to_pysam_AlignedSegment(template))

I have actually run into problems with this script improperly writing tags before. The full script used to call features and create a new sam file is here.

So in summary, i think the problem might be related to that issue you linked!

jmarshall commented 12 months ago

Many of the functions used in this snippet are HTSeq functions rather than pysam functions. In particular, get_sam_line() uses HTSeq's raw_optional_fields(), which doesn't distinguish between solo characters and single-character strings, using type A for both. I believe this HTSeq limitation is the cause of the MD:A:… fields in your SAM file. (The same function does not handle B arrays and defaults them to H, which I think is the root cause of your previous jI and jM problem.)

It would be relatively difficult to deploy fixes to both packages, so I have made get_aligned_pairs() understand both MD:Z:… and MD:A:… so that it can work around this HTSeq problem.