pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
782 stars 274 forks source link

When SAM line query_qualities="*"; AlignedSegment query_qualities -> Long incorrect array #1196

Closed dvklopfenstein closed 1 year ago

dvklopfenstein commented 1 year ago

Thank you for the fantastic pysam tools. We use this package all the time with great success.

ISSUE

We are currently seeing incorrect query_qualities values in a pysam AlignedSegment in some cases.

EXPECTED BEHAVIOR:

When a SAM line contains query_qualities='*', indicating that the query_qualities are not available, the pysam AlignedSegment generated from the SAM line correctly sets its query_qualities to None:

SAM line # field name SAM line pysam AlignedSegment
SAM line 1 query_qualities * None
SAM line 2 query_qualities * None
SAM line 3 query_qualities * None

ACTUAL (ERRONEOUS) BEHAVIOR:

But if minimap2 returns an odd SAM line (CIGAR=28N, when SEQ is ~700 basepairs), pysam's AlignedSegment invents an incorrect query_qualities array.

When the SAM line query_qualities is "*", the AlignedSegment's query_qualities value should be set to None, even if minimap2 is generating a questionable aligned segment line.

ATTENTION: The query_qualities array values in the AlignedSegment seem randomly generated and change if I comment out one or more of the passing SAM lines (lines 1-3) in the attached self-contained test.

SAM line # field name SAM line pysam AlignedSegment
SAM line 4 query_qualities * array('B', [176, 147, 139, 177, 195, 127, 0, 0, 224, 87, 119, 177, 195, 127, 0, 0, 101, 78, 15, 59, 56, 165, 22, 5, 176, 148, 139, 177, 195, 127, 0, 0, 112, 88, 119, 177, 195, 127, 0, 0, 50, 15, 238, 178, 249, 20, 97, 194, 48, 130, 139, 177, 195, 127, 0, 0, 0, 89, 119, 177, 195, 127, 0, 0, 47, 215, 188, 51, 91, 133, 44, 6, 48, 148, 139, 177, 195, 127, 0, 0, 144, 89, 119, 177, 195, 127, 0, 0, 214, 252, 24, 219, 162, 235, 176, 102, 48, 172, 139, 177, 195, 127, 0, 0, 32, 90, 119, 177, 195, 127, 0, 0, 98, 98, 247, 235, 128, 206, 17, 209, 48, 239, 135, 177, 195, 127, 0, 0, 32, 90, 119, 177, 195, 127, 0, 0, 71, 27, 145, 191, 78, 117, 75, 81, 112, 207, 137, 177, 195, 127, 0, 0, 176, 90, 119, 177, 195, 127, 0, 0, 63, 62, 31, 201, 211, 102, 195, 208, 112, 84, 142, 177, 195, 127, 0, 0, 64, 91, 119, 177, 195, 127, 0, 0, 180, 123, 159, 8, 178, 76, 122, 20, 48, 85, 142, 177, 195, 127, 0, 0, 208, 91, 119, 177, 195, 127, 0, 0, 207, 25, 153, 175, 183, 236, 206, 169, 176, 84, 142, 177, 195, 127, 0, 0, 96, 92, 119, 177, 195, 127, 0, 0, 222, 141, 94, 91, 96, 15, 134, 146, 176, 132, 139, 177, 195, 127, 0, 0, 96, 92, 119, 177, 195, 127, 0, 0, 199, 231, 99, 66, 174, 42, 8, 145, 112, 148, 117, 177, 195, 127, 0, 0, 240, 92, 119, 177, 195, 127, 0, 0, 179, 27, 28, 218, 169, 123, 82, 4, 240, 84, 142, 177, 195, 127, 0, 0, 128, 93, 119, 177, 195, 127, 0, 0, 188, 62, 90, 39, 84, 65, 211, 201, 112, 88, 142, 177, 195, 127, 0, 0, 16, 94, 119, 177, 195, 127, 0, 0, 166, 225, 110, 133, 99, 124, 123, 40, 112, 178, 139, 177, 195, 127, 0, 0, 160, 94, 119, 177, 195, 127, 0, 0, 252, 135, 48, 44, 58, 102, 222, 141, 48, 88, 142, 177, 195, 127, 0, 0, 48, 95, 119, 177, 195, 127, 0, 0, 237, 139, 252, 210, 231, 215, 186, 135, 176, 88, 142, 177, 195, 127, 0, 0, 192, 95, 119, 177, 195, 127, 0, 0, 41, 254, 179, 84, 56, 170, 29, 51, 240, 150, 117, 177, 195, 127, 0, 0, 80, 96, 119, 177, 195, 127, 0, 0, 3, 48, 98, 209, 200, 236, 44, 69, 240, 144, 139, 177, 195, 127, 0, 0, 224, 96, 119, 177, 195, 127, 0, 0, 196, 167, 141, 186, 238, 104, 14, 124, 176, 141, 139, 177, 195, 127, 0, 0, 112, 97, 119, 177, 195, 127, 0, 0, 178, 110, 234, 87, 137, 84, 95, 79, 240, 141, 139, 177, 195, 127, 0, 0, 0, 98, 119, 177, 195, 127, 0, 0, 183, 54, 174, 89, 52, 93, 5, 218, 112, 158, 139, 177, 195, 127, 0, 0, 0, 7, 119, 177, 195, 127, 0, 0, 145, 42, 155, 24, 97, 179, 110, 222, 112, 77, 150, 177, 195, 127, 0, 0, 64, 7, 119, 177, 195, 127, 0, 0, 18, 54, 201, 125, 124, 209, 140, 119, 64, 206, 139, 177, 195, 127, 0, 0, 96, 119, 105, 177, 195, 127, 0, 0, 177, 139, 6, 16, 101, 242, 16, 21, 176, 22, 127, 177, 195, 127, 0, 0, 192, 14, 119, 177, 195, 127, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

Test

Run this small self-contained test to see the error:

Conclusion

When the SAM line query_qualities is "*", the AlignedSegment's query_qualities value should be set to None, even if minimap2 generates a questionable aligned segment line.

Thank you for taking your time to investigate this issue. Thank you for such a well-done tool suite.

cc: @judowill

jmarshall commented 1 year ago

Thanks for the detailed bug report. I've fixed the problem but possibly not in a way that you're going to like…

When you run your script, you may notice this message go by:

[E::sam_parse1] CIGAR and query sequence are of different length

The “odd SAM line” with a short CIGAR length but long SEQ is invalid enough that samtools refuses to read it:

$ samtools view bork.sam
…
[E::sam_parse1] CIGAR and query sequence are of different length
[W::sam_read1_sam] Parse error at line 6
samtools view: error reading file "bork.sam"

(I assume you have reported this invalid output as a minimap2 issue.)

Pysam uses the same parsing code, and there was a small glitch in that it ignored the error code and carried on with a partially initialised record instead of propagating the error by raising an exception.

So we just went ahead and fixed the glitch.