philres / ngmlr

NGMLR is a long-read mapper designed to align PacBio or Oxford Nanopore (standard and ultra-long) to a reference genome with a focus on reads that span structural variations
MIT License
284 stars 41 forks source link

MAPQ of -2147483648 attributed to some reads in samfile (minimum value for an int32) #83

Open OmarOakheart opened 3 years ago

OmarOakheart commented 3 years ago

I'm using the following version of ngmlr:

ngmlr 0.2.7 (build: Jul 3 2020 03:31:03, start: 2020-07-19.14:10:32)

I ran the following command:

ngmlr -t 8 -x ont -r $reference -q $file.fastq.gz -o $file.sam

And one (probably more) of the lines outputs with a MAPQ of -2147483648

CCN_ACA_BMB_26 2048 chromosome15 1074809 -2147483648 2561S12M2D21M2I90M1D15M1I2M2D7M1D5M1I13M1S

I didn't paste the rest to avoid clutter but it's a 2731 bp read so nothing crazy in that respect (I checked in the original fastq file)

To provide more context, I think this is an issue that used to go unnoticed for me, but I now use samtools v1.10, which seems to now check for MapQ and output an error if it's <0

if (b->core.qual < settings->min_mapQ || ((b->core.flag & settings->flag_on) != settings->flag_on) || (b->core.flag & settings->flag_off))
        return 1;

(From: https://github.com/samtools/samtools/blob/a6a160bba842d6afc3da061a8aa4f199fd7b729b/sam_view.c)

Still, I don't know what would cause a MAPQ of -1 in the first place

This issue is also found in https://github.com/philres/ngmlr/issues/75 (looks like no one noticed the negative MAPQ but it's there. I'm not sure if there's a way to merge issues or so, I'm not proficient with github yet.

My hunch is that the following happens:

The mapping for the offending read is an extremely low quality, so I think there are issues with one of the signs in the these lines:

uint32_t mapq = -4.343 * log(1 - (double)abs(a->score1 - a->score2)/(double)a->score1);
mapq = (uint32_t) (mapq + 4.99);

I gather than mapq should be a positive integer, therefore I assume that

log(1 - (double)abs(a->score1 - a->score2)/(double)a->score1) <0

Therefore

1 - (double)abs(a->score1 - a->score2)/(double)a->score1 <1

Therefore

(double)abs(a->score1 - a->score2)/(double)a->score1 >0

Of course, what seems to be happening is the opposite, so the bug is that sometimes

(double)abs(a->score1 - a->score2)/(double)a->score1 <0

This is where my understanding reaches its limit (and maybe the bug is elsewhere but this seems like a good lead).

Either sometimes score1<score2, or sometimes score1<0

I'm not sure I understand what score1 and score2 represent so I don't know if either one is a plausible scenario. I hope all of this helps.

rhysnewell commented 3 years ago

I'm having the exact same issue, my guess would be there seems to be some sort of underflow occurring somewhere but I've not looked through the code to try and pinpoint it.