Closed isaacvock closed 12 months ago
Thanks for the comprehensive bug report.
It happens that the MD tag's value in the failing record in your test file is 9 — one character in length — and in this record it is written as MD:A:9
, a single-character field. Normally, and according to the spec, these are string fields written as MD:Z:…
and the pysam code expects only this field type.
So certainly pysam should raise an exception instead of crashing when it sees MD:A:…
. As an extension, I'm tempted to have it instead accept it and interpret it as a string in this case, in which case your test file would work as expected.
But the real question is how this strictly speaking invalid field got into your SAM file.
I've looked at the STAR source code and it is unlikely to write an MD field like this. Samtools would not. Are there other programs in your pipeline not reflected in your test file's @PG
headers that may have caused this?
There is a behaviourally related pysam bug (#899) that I mean to address sometime, but AIUI it would change the type in the other direction. So that is probably not the culprit.
Thank you for getting back to me so quickly with a detailed answer. The one other program that is not reflected in the @PG
header is a custom HTSeq script that calls features the read overlaps (this is what the GF
, EF
, and XF
tags in the provided sam file refer to). I checked the bam files produced by STAR and they in fact do have the proper MD:Z:...
. The culprit in the custom htseq script is a function (borrowed from the actual HTSeq repo) that uses pysam to create a new sam file that gets passed to the mutation counting script where the problems cropped up:
def write_to_samout(r, assignment, samoutfile, template=None):
'''
assignment: is a list of strings
'''
if samoutfile is None:
return
if not pe_mode:
r = (r,)
tags = ['GF','EF','XF','XG','XH','XI','XJ','XK','XL','XN','XO','XP','XQ','XR','XS','XT','XU']
for read in r:
if read is not None:
for p in range(0, len(assignment)):
read.optional_fields.append((tags[p], assignment[p]))
if samout_format in ('SAM', 'sam'):
samoutfile.write(read.get_sam_line() + "\n")
else:
samoutfile.write(read.to_pysam_AlignedSegment(template))
I have actually run into problems with this script improperly writing tags before. The full script used to call features and create a new sam file is here.
So in summary, i think the problem might be related to that issue you linked!
Many of the functions used in this snippet are HTSeq functions rather than pysam functions. In particular, get_sam_line()
uses HTSeq's raw_optional_fields()
, which doesn't distinguish between solo characters and single-character strings, using type A
for both. I believe this HTSeq limitation is the cause of the MD:A:…
fields in your SAM file. (The same function does not handle B
arrays and defaults them to H
, which I think is the root cause of your previous jI
and jM
problem.)
It would be relatively difficult to deploy fixes to both packages, so I have made get_aligned_pairs()
understand both MD:Z:…
and MD:A:…
so that it can work around this HTSeq problem.
Hello,
When working with a particular bam file using pysam, I came across a strange error that only occurs when calling
r.get_aligned_pairs(with_seq = True)
on certain reads in the bam file. For those reads, a segmentation fault is thrown and Python crashes. I have attached a sam file that includes 4 reads, 1 of which causes the segmentation fault error when runningget_aligned_pairs
, though only if thewith_seq
option is True. I originally came across this problem with pysam version 0.16.0, but was able to reproduce it with version 0.21.0 as well. I am using Python version 3.8.6 and have reproduced the error on the windows subsystem for linux as well as a shared cluster running Red Hat Enterprise Linux version 8.6. I have also confirmed that the error reproduces on the WSL with pysam version 0.21.0 and Python version 3.10.2.Sam file for reproducible example (gzipped to make Github happy):
problem_child.sam.gz
Reproducible example:
I am happy to provide any additional information. I apologize if I have missed an existing issue report about this same problem, I checked but did not see any.
UPDATE: After combing through the bam file and finding more problematic reads, it seems like the problem often arises for a read with a CIGAR string of
31S9M
or9M31S
. Of the 21 problematic reads I was able to chase down,31S9M
or9M31S
was the CIGAR string of 20 of them. There was one instance where a read with the CIGAR string4S8M28S
also was problematic (so a similar fraction of softclipped and matched reads). After a bit more digging though, I confirmed that not all reads with a CIGAR string of31S9M
or9M31S
throw a segfault error with.get_aligned_pairs(with_seq = True)
, so still pretty idiosyncratic...Best, Isaac