Open mrvollger opened 3 years ago
My current hacked workaround in case it is useful to anyone else is to mask all aligned sequences in the reference and query and then map again and combine the PAFs.
out="GRCh38_to_T2T.CHM13.v1.0_mm2_v2.22.paf"
hg38="../assemblies/hg38.chr_only.fa"
t2t="../assemblies/chm13.draft_v1.0.fasta"
minimap2 \
-cx asm20 --eqx --secondary=no \
-s 10000 -t 100 -K 8G \
$t2t \
$hg38\
> $out
minimap2 \
-cx asm20 --eqx --secondary=no \
-s 10000 -t 100 -K 8G \
<(seqtk seq \
-M <(cut -f 6,8,9 $out | bedtools sort -i -) \
-n "N" $t2t \
) \
<(seqtk seq \
-M <(cut -f 1,3,4 $out | bedtools sort -i -) \
-n "N" $hg38 \
) \
> missing.$out
I can confirm this is a bug, at least not an intended behavior. v2.19+ is missing this inversion because v2.19+ can chain through this region (older versions can't). Although minimap2 has a procedure to rescue inversions, it is somehow not effective over this ~1Mb inversion. I am looking at this. It may take time to fix.
@fenderglass your #806 is probably caused by the same issue. Sorry for the late response.
@fenderglass Do you still have the examples shown in #806? It seems simpler than Mitchell's chr16 case and might be easier for debugging. Thanks!
I believe I have alleviated the issue with the current HEAD – a complete fix would require a chaining algorithm that can go through inversions, which is hard. The -cxasm20
output now is (all >500kb blocks sorted by the starting coordiante):
chr16 90338345 10572 1238837 + chr16 96330374 3490 1254476 1221348 1256088
chr16 90338345 1247503 15115698 + chr16 96330374 1263138 15116748 13794683 13906654
chr16 90338345 15116377 21583053 + chr16 96330374 15117336 21517799 6357011 6454428
chr16 90338345 21397592 22699430 - chr16 96330374 21517870 22819179 1300163 1302533 <- was missing
chr16 90338345 22702890 28471878 + chr16 96330374 22980600 28752493 5751049 5784045
chr16 90338345 28643193 29226811 + chr16 96330374 28923849 29508118 581809 585077
chr16 90338345 29227349 32828272 + chr16 96330374 29508658 33242028 3593735 3738021
chr16 90338345 32645246 33214595 - chr16 96330374 33070249 33639393 568804 569466 <- was missing
chr16 90338345 33491370 34014729 + chr16 96330374 33668104 34181919 508904 525482
chr16 90338345 34912151 36260386 - chr16 96330374 37857994 39216720 1347464 1358890
chr16 90338345 46381024 72351794 + chr16 96330374 52177860 78168088 25921269 26018581
chr16 90338345 72351998 82130592 + chr16 96330374 78168396 88194329 9756449 10034708
chr16 90338345 82130902 85155339 + chr16 96330374 88194391 91221413 3016503 3029537
chr16 90338345 85156195 90228345 + chr16 96330374 91222304 96316463 5046514 5111729
@fenderglass I also tried the new HEAD on your data and manually checked one locus on contig 142. I think that case is also fixed.
Still need to do more test before cutting a release.
Hi Heng!
Thanks for the fix! It seems to be working in my hands as well. However, one thing I noticed is that a lot of inversions do not have clean breakpoints and overlap with the continuous alignments. This is probably a totally separate issue, but it would be great if at some point these breakpoints could be refined.
Cheers, Mitchell
a lot of inversions do not have clean breakpoints and overlap with the continuous alignments.
I see this often as well. It seems that flanking sequences around large inversions tend to have homology, which causes the overlap. It is difficult to resolve such overlaps with the current minimap2 algorithm.
Hi Heng,
Thanks for looking into it.
Continuing https://github.com/lh3/minimap2/issues/823
The current HEAD still requires some more improvements. The 10kbp inversion is still not identified while the 5kbp inversion is only aligned for 1141bps.
Best Manish
Could you send me the data? Thanks.
Here it is:
ref.txt seq_inv5000.fa.txt seq_inv10000.fa.txt
My command: /software/TMP/minimap2_HEAD/minimap2 -cx asm5 ref.txt seq_inv10000.fa.txt
Thanks. Long inv and short inv are handled differently. 10k is right on the threshold between long and short, which leads to a corner case that was not considered. The 10k example is fixed on github HEAD.
The 5k one is not fixed. Actually all versions of minimap2 have this issue. It is triggered more frequently as a side-effect of finding long indels. I will think about a solution later.
PS: I knew the 5k issue would happen. It is great that I have an example that can trigger the issue.
Thanks for fixing this. I tested on some more simulated inversion and it worked fine. The 5k issue didn't happen in any other case, so cant share more examples.
Another related question is that minimap2 does not report small (100bp) inverted alignments. Is it possible to adjust some parameters to increase the sensitivity of minimap2 in identifying more of these smaller inversions?
Sample inversion: seq_inv100.fa.txt
Minimap2 is not intended to call 100bp inversions. You can force minimap2 to align small inversions with
minimap2 -cxasm20 -z200,50 -s50
but I am not sure this would be an improvement.
Thanks for sharing the parameters. I understand the limitations in finding small genomic rearrangements.
@mrvollger Although v2.23 fixed the chr16 inversion, I noticed minimap2 missed another inversion on chr8 chr3. The new v2.24 release is using a more robust solution and has this fixed.
Thanks @lh3! Out of curiosity was the issue in the beta def region ~6-14Mbp?
Sorry, I meant to say chr3, towards the end of chr3. I haven't checked that chr8 inversion yet.
Got it, thanks!
@lh3 I wrote a utility that tries to find an optimal split when the same query sequence is mapped twice over multiple alignments. Do you have any examples of this that I might do some further testing on? I have been testing on NOTCH2NL but additional tests would be great.
Hi Heng,
I am looking at a 1 Mbp region on chr16 that doesn't seem to align when comparing GRCh38 and T2T, but there is a lot of homology in the region. When I do a whole genome alignment I get about 1 Mbp of missing alignment around ~
chr16:21,000,000-22,000,000
and more specifically I get these alignments (including the flanking regions):With this command (version 2.22):
And here is an image of the alignment (T2T on the bottom)
However if I take just the region that is mostly unaligned and align it then I get a nearly complete alignment in the reverse complement, suggesting a missed inversion I think.
Command:
I have also dropped the two paf files and references at this link: https://eichlerlab.gs.washington.edu/help/mvollger/share/mm2-issue-chr16/
I am not sure the cause of this, but it seems like a large event to miss, and hopefully I this is not just some mistake on my part.
(And credit to Ariel Gershman for finding this issue, if it is a real issue and not some mistake on my part)
Thanks in advance! Mitchell