pysam-developers / pysam

Pysam is a Python package for reading, manipulating, and writing genomics data such as SAM/BAM/CRAM and VCF/BCF files. It's a lightweight wrapper of the HTSlib API, the same one that powers samtools, bcftools, and tabix.
https://pysam.readthedocs.io/en/latest/
MIT License
774 stars 274 forks source link

pysam crashes when quality only has a single value of "*" #1211

Open ymcki opened 1 year ago

ymcki commented 1 year ago

I have a unmapped bam with many entries of basecalls with only one base and a quality of "*".

pysam 0.21.0 then crashes whenever it reaches a line like that.

Traceback (most recent call last):
  File "./q.py", line 8, in <module>
    sum_bq += sum(read.query_qualities)
TypeError: 'NoneType' object is not iterable

The simple progam that allows you replicate the problem is:

import pysam

bam = sys.argv[1]

sum_bq = 0
bam_file = pysam.AlignmentFile(bam, "r", check_sq=False, threads=4)
for read in bam_file:
    sum_bq += sum(read.query_qualities)

This is the content of the ubam file presented in sam format I used to reproduce the error

@HD VN:1.6  SO:unknown
@PG ID:basecaller   PN:dorado   VN:0.2.1+c70423e    CL:dorado duplex -t 96 /models/dna_r10.4.1_e8.2_400bps_sup@v4.1.0 /pod5/ --pairs pairs_from_bam/pair_ids_filtered.txt
9642cde2-f8c3-4741-86cc-0fcbe29776d7;7c1c2b91-f9ff-4d5d-87c5-83c29b74c749   4   *   0   0   *   *   0   1   C   *   qs:i:9
88f36782-d9dc-45fa-92e7-347f1493b65b;ee2c206e-41bb-4c2a-9575-5170fe7f3ec0   4   *   0   0   *   *   0   1   C   *   qs:i:9
0f2f76b2-beb8-4625-8f75-a8065a72cbe6;51d2ca59-ba74-4f82-82d3-09490bbf8b21   4   *   0   0   *   *   0   1   T   *   qs:i:9
38b6482b-0c91-42d0-92b1-6d4298d0eba5;7ade0e9f-c25c-43be-99de-d7bf59e660a9   4   *   0   0   *   *   0   1   T   *   qs:i:9
d7dc1c40-d297-4c2e-94cf-015d80dc7aab;d38c2d9f-e3bd-4c23-9cc7-5c9e6b704528   4   *   0   0   *   *   0   1   C   *   qs:i:9
8d7d8c03-c449-4bf0-af44-36ae3271d6fe;2a851990-847f-42b6-9ad0-f2b809868599   4   *   0   0   *   *   0   1   C   *   qs:i:9
55e10091-8299-446b-a31a-ca3c5f57088a;47ab3a62-383c-478d-bc0e-030be2da5786   4   *   0   0   *   *   0   1   T   *   qs:i:9

My guess is that pysam treated single "" as this field is empty instead of single base quality of

jmarshall commented 1 year ago

In SAM, in general a QUAL field of * indicates that base qualities are not available. Historically no-one has particularly cared about single-base reads and it is unspecified whether in that case * indicates unavailable or a base quality of 9. That ambiguity is samtools/hts-specs#715, on which you may wish to express an opinion.

However there is no such ambiguity in BAM. If the “ubam file” you are actually feeding to this script is a BAM file, then getting None here indicates that the record really does have QUAL absent. (However the qs:i:9 tag, if it is indeed an average base quality score, suggests that this data originated from a SAM file that intended QUAL = * to mean base quality 9…)

Pysam is not crashing here. Your script is crashing when read.query_qualities returns None, which your script is not dealing with. This property is None when the QUAL field is absent, and your script probably needs to deal with this possibility.

Are these single-base reads important in your analysis, or are they a handful of degenerate reads that could be filtered out without adverse effect? Are there other single-base reads present in your data with other characters in their QUAL fields?

ymcki commented 1 year ago

Thanks for your reply. I don't know about there is an undefined spec regarding this situation. Intuitively, I would think it is a way to specify a single base read with quality 9.

I encountered this situation while using ONT's dorado basecaller. Anyway, I will relay this ambiguity to the dorado basecaller team and see what they want to do with it.