philres / ngmlr

NGMLR is a long-read mapper designed to align PacBio or Oxford Nanopore (standard and ultra-long) to a reference genome with a focus on reads that span structural variations
MIT License
293 stars 40 forks source link

Mapping to repeats leads to deletions with low allele frequency #76

Open flashton2003 opened 4 years ago

flashton2003 commented 4 years ago

Hello,

I'm analysing some Cryptococcus neoformans (a haploid fungus) PacBio genome data. I noticed something strange when I was looking at some deletions which had low allele frequency. When only part of a repeated region was deleted, sometimes NGMLR was not consistent with how it split the read. Here is a clear example.

Screenshot 2020-02-04 at 15 42 02

There is a TTCTTCCCCC motif repeated four times in the reference genome. Most of the reads which map there only support there being one TTCTT part of the motif left (probably CCCCCTTCTTCCCCC), but the reads are mapped to different 'ends' of the 4-fold repeat in the reference genome. This means that the allele frequency is not as high as it should be, because each end of the deletion is only supported by around half the reads.

When I looked at the variants sniffles called, quite a lot of my deletions with low allele frequencies were in repeat regions.

I just wondered if there was a way to place these reads in repeat regions more consistently, as this would lead to more variants passing an allele frequency threshold of 80%.

Best,

Phil Ashton

fritzsedlazeck commented 4 years ago

Dear Phil, thanks for reaching out. Yes this is a problem. Most of the time one requires some randomness in the alignment backtracking procedure to not accumulate artifacts. However, in these regions, this is less favorable.

Can you tell me if you tried to use the newer version of Sniffles and still get low frequency in such a region? I tried to improve this recently. Thanks Fritz

flashton2003 commented 4 years ago

Hi Fritz,

I thought you might have come across this issue, it seems quite common in my data. Perhaps these repeat regions are susceptible to indels?

I'm using v1.0.11, which I think is the most up to date version?

Best,

Phil

fritzsedlazeck commented 4 years ago

Hi Phil, Its a common problem I am investigating STR regions especially. Go to the github from Sniffles and try v1.14 that improved a lot in GT and estimating the frequency. Cheers Fritz

fritzsedlazeck commented 4 years ago

Oh my bad 1.11 is the newest. Sorry beeing jetlaged in Brussel at the moment...

flashton2003 commented 4 years ago

Ah, no worries.

Any thoughts on alternative filtering criteria, other than AF, which might help us include some of these ones?

fritzsedlazeck commented 4 years ago

I will need to think about it. I am up since yesterday..