samtools / htslib

C library for high-throughput sequencing data formats
Other
800 stars 446 forks source link

When using the sam_parse1 function with third-generation sequencing data, an error will be reported: invalid QUAL character. #1813

Closed zhaobu closed 2 weeks ago

zhaobu commented 1 month ago

After I used minimap2 to align the Nanopore third-generation sequencing data, when I tried to convert the alignment result string kstring_t to bam1_t, I used the sam_parse1 function, but it resulted in an error "invalid QUAL character"

whitwham commented 1 month ago

What is the bad QUAL character?

zhaobu commented 1 month ago

test.fastq.txt

I have read the implementation part of the COPY_MINUS_N function in the sam_parse1 function, and found that the range of quality values checked by this function is within the left-closed and right-open interval [33,128). I am not sure if my understanding is correct, but now my Nanopore samples are causing this error. However, I observed the quality values of the line that reported the error, and they are also within this range.

In the small sample above, the two lines starting with @4b677263-10cb-4fe1-b498-9c2a36445419 and @aa2a455b-7a48-4ddd-9dbf-21008109a7f5 will cause this error.

jkbonfield commented 1 month ago

There's nothing outside of the legal quality values in that fastq. They range from " (qual 1) to b (qual 65). We'd need to see the minimap2 output to be sure, but I suspect it's a problem with minimap2 generating invalid data and not htslib parsing it.

You can verify this with samtools view test.fastq.txt which ingests the fastq file and turns it to unmapped SAM. It can convert the quality strings back again (samtools fastq command) intact.

jkbonfield commented 2 weeks ago

Closing as no response.