philres / ngmlr

NGMLR is a long-read mapper designed to align PacBio or Oxford Nanopore (standard and ultra-long) to a reference genome with a focus on reads that span structural variations
MIT License
284 stars 41 forks source link

incomplete aux field and negative mapq? #82

Closed kautto closed 4 years ago

kautto commented 4 years ago

I'm getting an odd error with some HiFi data I was trying to align. When I pipe the output straight from ngmlr to samtools sort, it crashes with something along the lines of:

[E::sam_parse1] incomplete aux field
samtools sort: truncated file. Aborting

If I write out the SAM output first and try to samtools view it, it's instead

[W::sam_read1] Parse error at line 3870
[main_samview] truncated file.

Looking at the actual offending read in question, it's clear something went awfully wrong, since there's a negative mapping quality:

m64034_200304_183416/148374403/ccs  2064    chr13   65543216    -2147483648 17925S54M1D23M  *   0   0   [long sequence string] [long quality string]    AS:i:142    NM:i:2  XI:f:0.9744 XS:i:0  XE:i:142    XR:i:77 MD:Z:54^T1T21   SV:i:2  SA:Z:chr13,21240689,-,6S5022M1D1956M1D32M1D834M1I5928M6I3152M1I791M1D156M1I39M77S,60,39;    QS:i:17925  QE:i:18002  CV:f:0.427730   

It looks like an integer overflow issue, but I have no idea why it's suddenly happening to some reads. I'm running ngmlr 0.2.7 and samtools 1.10

There's nothing particularly weird about the data, it's human data aligned to hs38DH and I used --bam-fix (although that doesn't seem to apply in this case).

Without having to write some kind of intermediate parser to strip out bad alignments, is there an easy fix to this?

fritzsedlazeck commented 4 years ago

Thanks @kautto for reaching out. Just to check the obvious. These reads have been already collapsed with the pacbio ccs program?

Other than that I would need some examples to be able to see if I can help you.. Thanks Fritz

kautto commented 4 years ago

Hi Fritz,

yeah, they're actual HiFi/CCS reads, run through SMRTLink 8.

As a bit more detail, the reads are a subset of reads extracted from an already aligned BAM, i.e.

  1. HiFi/CCS reads were aligned to hs38
  2. A subset of reads were extracted for further analysis (samtools view followed by samtools bam2fq)
  3. When realigning the reads to the same reference, a small number of them have a problem like that.

Step 2 isn't actually changing anything - I checked to make sure the FASTQ records stayed the same.

I just noticed that step 1 was done with ngmlr 0.2.7 and htslib/samtools 1.9, and step 3 (where it fails) is with htslib/samtools 1.10.

If I take the failing reads and attempt realignment in a conda environment with htslib 1.9, it works again - but as soon as I switch to 1.10 it fails. So it looks like some kind of incompatiblity with the newest version of htslib?

It only affects a small subset of reads, 36 out of about 280k, all which are supplementary (flag 2048 or 2064). I'll email you a FASTQ with some example reads so you can try to reproduce it.

Best, Esko

kautto commented 4 years ago

In case others run into this issue, it seems to be related to the new samtools/htslib version (1.10). Reverting back to 1.9 works fine, so that's my workaround for now.