lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
https://lh3.github.io/minimap2
Other
1.81k stars 412 forks source link

Missing 1 Mbp of alignment on Chromosome 16 between T2T-CHM13 v1.0 and GRCh38 #816

Open mrvollger opened 3 years ago

mrvollger commented 3 years ago

Hi Heng,

I am looking at a 1 Mbp region on chr16 that doesn't seem to align when comparing GRCh38 and T2T, but there is a lot of homology in the region. When I do a whole genome alignment I get about 1 Mbp of missing alignment around ~chr16:21,000,000-22,000,000 and more specifically I get these alignments (including the flanking regions):

chr16   90338345        15116377        21305673        +       chr16   96330493        15117341        21337915
chr16   90338345        21430746        21466480        +       chr16   96330493        21709065        21744757
chr16   90338345        22472332        28471794        +       chr16   96330493        22750307        28752414

With this command (version 2.22):

minimap2 \
            -cx asm20 --eqx --secondary=no \
            -s 10000 -t 100 -K 8G \
            ../assemblies/chm13.draft_v1.0.fasta ../assemblies/hg38.chr_only.fa 

And here is an image of the alignment (T2T on the bottom) image

However if I take just the region that is mostly unaligned and align it then I get a nearly complete alignment in the reverse complement, suggesting a missed inversion I think.

chr16:21305673-22472332 1166660 160939  1166660 -       chr16:21337915-22750307 1412393 406974  1412393 1005059
chr16:21305673-22472332 1166660 91920   196680  +       chr16:21337915-22750307 1412393 337923  442372  103880  105150

Command:

minimap2 \
  -cx asm20 --eqx --secondary=no \
  -s 10000 -t 100 -K 8G \
  <(samtools faidx ../assemblies/chm13.draft_v1.0.fasta chr16:21337915-22750307 ) \
  <(samtools faidx ../assemblies/hg38.chr_only.fa chr16:21305673-22472332 )  \

I have also dropped the two paf files and references at this link: https://eichlerlab.gs.washington.edu/help/mvollger/share/mm2-issue-chr16/

I am not sure the cause of this, but it seems like a large event to miss, and hopefully I this is not just some mistake on my part.

(And credit to Ariel Gershman for finding this issue, if it is a real issue and not some mistake on my part)

Thanks in advance! Mitchell

mrvollger commented 3 years ago

My current hacked workaround in case it is useful to anyone else is to mask all aligned sequences in the reference and query and then map again and combine the PAFs.

out="GRCh38_to_T2T.CHM13.v1.0_mm2_v2.22.paf"
hg38="../assemblies/hg38.chr_only.fa"
t2t="../assemblies/chm13.draft_v1.0.fasta"

minimap2 \
  -cx asm20 --eqx --secondary=no \
  -s 10000 -t 100 -K 8G \
  $t2t \
  $hg38\
  > $out 

minimap2 \
  -cx asm20 --eqx --secondary=no \
  -s 10000 -t 100 -K 8G \
  <(seqtk seq \
      -M <(cut -f 6,8,9 $out | bedtools sort -i -) \
      -n "N" $t2t \
    ) \
  <(seqtk seq \
      -M <(cut -f 1,3,4 $out | bedtools sort -i -) \
      -n "N" $hg38 \
    ) \
  > missing.$out 

image

lh3 commented 3 years ago

I can confirm this is a bug, at least not an intended behavior. v2.19+ is missing this inversion because v2.19+ can chain through this region (older versions can't). Although minimap2 has a procedure to rescue inversions, it is somehow not effective over this ~1Mb inversion. I am looking at this. It may take time to fix.

@fenderglass your #806 is probably caused by the same issue. Sorry for the late response.

lh3 commented 3 years ago

@fenderglass Do you still have the examples shown in #806? It seems simpler than Mitchell's chr16 case and might be easier for debugging. Thanks!

lh3 commented 3 years ago

I believe I have alleviated the issue with the current HEAD – a complete fix would require a chaining algorithm that can go through inversions, which is hard. The -cxasm20 output now is (all >500kb blocks sorted by the starting coordiante):

chr16 90338345 10572    1238837  + chr16 96330374 3490     1254476  1221348  1256088
chr16 90338345 1247503  15115698 + chr16 96330374 1263138  15116748 13794683 13906654
chr16 90338345 15116377 21583053 + chr16 96330374 15117336 21517799 6357011  6454428
chr16 90338345 21397592 22699430 - chr16 96330374 21517870 22819179 1300163  1302533  <- was missing
chr16 90338345 22702890 28471878 + chr16 96330374 22980600 28752493 5751049  5784045
chr16 90338345 28643193 29226811 + chr16 96330374 28923849 29508118 581809   585077
chr16 90338345 29227349 32828272 + chr16 96330374 29508658 33242028 3593735  3738021
chr16 90338345 32645246 33214595 - chr16 96330374 33070249 33639393 568804   569466   <- was missing
chr16 90338345 33491370 34014729 + chr16 96330374 33668104 34181919 508904   525482
chr16 90338345 34912151 36260386 - chr16 96330374 37857994 39216720 1347464  1358890
chr16 90338345 46381024 72351794 + chr16 96330374 52177860 78168088 25921269 26018581
chr16 90338345 72351998 82130592 + chr16 96330374 78168396 88194329 9756449  10034708
chr16 90338345 82130902 85155339 + chr16 96330374 88194391 91221413 3016503  3029537
chr16 90338345 85156195 90228345 + chr16 96330374 91222304 96316463 5046514  5111729

@fenderglass I also tried the new HEAD on your data and manually checked one locus on contig 142. I think that case is also fixed.

Still need to do more test before cutting a release.

mrvollger commented 3 years ago

Hi Heng!

Thanks for the fix! It seems to be working in my hands as well. image However, one thing I noticed is that a lot of inversions do not have clean breakpoints and overlap with the continuous alignments. This is probably a totally separate issue, but it would be great if at some point these breakpoints could be refined.

Cheers, Mitchell

lh3 commented 3 years ago

a lot of inversions do not have clean breakpoints and overlap with the continuous alignments.

I see this often as well. It seems that flanking sequences around large inversions tend to have homology, which causes the overlap. It is difficult to resolve such overlaps with the current minimap2 algorithm.

mnshgl0110 commented 3 years ago

Hi Heng,

Thanks for looking into it.

Continuing https://github.com/lh3/minimap2/issues/823

The current HEAD still requires some more improvements. The 10kbp inversion is still not identified while the 5kbp inversion is only aligned for 1141bps.

Best Manish

lh3 commented 3 years ago

Could you send me the data? Thanks.

mnshgl0110 commented 3 years ago

Here it is:

ref.txt seq_inv5000.fa.txt seq_inv10000.fa.txt

My command: /software/TMP/minimap2_HEAD/minimap2 -cx asm5 ref.txt seq_inv10000.fa.txt

lh3 commented 3 years ago

Thanks. Long inv and short inv are handled differently. 10k is right on the threshold between long and short, which leads to a corner case that was not considered. The 10k example is fixed on github HEAD.

The 5k one is not fixed. Actually all versions of minimap2 have this issue. It is triggered more frequently as a side-effect of finding long indels. I will think about a solution later.

PS: I knew the 5k issue would happen. It is great that I have an example that can trigger the issue.

mnshgl0110 commented 3 years ago

Thanks for fixing this. I tested on some more simulated inversion and it worked fine. The 5k issue didn't happen in any other case, so cant share more examples.

Another related question is that minimap2 does not report small (100bp) inverted alignments. Is it possible to adjust some parameters to increase the sensitivity of minimap2 in identifying more of these smaller inversions?

Sample inversion: seq_inv100.fa.txt

lh3 commented 3 years ago

Minimap2 is not intended to call 100bp inversions. You can force minimap2 to align small inversions with

minimap2 -cxasm20 -z200,50 -s50

but I am not sure this would be an improvement.

mnshgl0110 commented 3 years ago

Thanks for sharing the parameters. I understand the limitations in finding small genomic rearrangements.

lh3 commented 2 years ago

@mrvollger Although v2.23 fixed the chr16 inversion, I noticed minimap2 missed another inversion on chr8 chr3. The new v2.24 release is using a more robust solution and has this fixed.

mrvollger commented 2 years ago

Thanks @lh3! Out of curiosity was the issue in the beta def region ~6-14Mbp?

lh3 commented 2 years ago

Sorry, I meant to say chr3, towards the end of chr3. I haven't checked that chr8 inversion yet.

mrvollger commented 2 years ago

Got it, thanks!

mrvollger commented 2 years ago

@lh3 I wrote a utility that tries to find an optimal split when the same query sequence is mapped twice over multiple alignments. Do you have any examples of this that I might do some further testing on? I have been testing on NOTCH2NL but additional tests would be great.