lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
https://lh3.github.io/minimap2
Other
1.81k stars 412 forks source link

Order of operations in CIGAR #1118

Open baraaorabi opened 1 year ago

baraaorabi commented 1 year ago

I found that Minimap2 sometimes generates an unusual order of CIGAR opertations; with some Ns, followed by some Is then followed by more Ns instead of the more expected order of having all the Ns combined together.

Here is an example that produces this result:

$ minimap2 -c -x splice -t 32 data/refs/homo_sapiens.dna.fa <(echo -e ">1\nCATGGACTCATCCAGGATACATCAGGAAC
TCAATGGCAAAAAAATCCAATTTAAAAATGAACAAAATACCTAAATAGACACCTCTCAAAGAAGAAACACAAGGCCAGGCGCAGTGGCTCATGCCTATAATCCCAGCACTTTGGGAGGCTGAGGCAGGAGGATTGCCTGAGCTCGGGAGTTCGAGACC
AGCCTGGGCAACACGGTGAAACCCCGTCTCTACTAAAAATACAAAAATTAGCTGGGCATGGTGGCGCACGCGTGTACGCCAGCTACTGGAAGGCTGAGACAGGAGAATCACTGGAACCTGGGAGGCGGAGGTTTCAGTAAGCCAAGATTGCACCACTG
CACTCCAGCCTGGGTGACAAAGCAAGACTCCATCTCAAAAAAAAAAAAAAAAAAAAAAAAAGCTAGATTCACAACATCGATGTTCCGGAGATCGGAAGAGCGTCGTGTAGAGGTTAAACACCCAAGCAGACGCCGCAATATCAGCACCAACAGAAAAC
AGACGACTACAAACGGAATCGAGTCACGGTAGGCGATAATCATAAACACCAGCTCCTAGCAACTGAACGAAGCACACTGATGACAAGAAAGTTGTCGGTGTCTTTGTGACTTGCCTGTCGCTCTATCTTCGGCGTCTGCTTGGGTGTTTAACCTCTGC
CACGACGCTCTTCCGATCTCAGCAGCAGATGCTTCTATTCTCTCGGTTTTTTTTTTTTTTTTTTTTTTTTTTACATTTCAAAATATTTAACAAAGTCAAACTTTCTCACCATGGTTTCAGTTTAGTGGAAGCATTTACTAAAGTACAAAAAGCCTCAG
AAAACGTGATGGGCAATATCTGGGCCCCAAGTTACCAGAAAGGGCACCAGCCAATATAGCACTGGCAGAGGTTTTCATGGGATGTCGCTTGTTTGATGAGCAGCTCAACTTGCGTTGGAACATTCAAAGTGTCATCATGAGAGAAGTCCCGACCAGTG
AGCTTATCTCTGAACCCTGTTAATAATCTGATAGCTTTTCTTCCTGGCGTGTACTCTGCGTTGATACCACCAAGCTAGGTTAAACACCCAAGCAGACGCC")
[M::mm_idx_gen::89.011*1.72] collected minimizers
[M::mm_idx_gen::99.459*3.62] sorted minimizers
[M::main::99.459*3.62] loaded/built the index for 194 target sequence(s)
[M::mm_mapopt_update::106.313*3.45] mid_occ = 765
[M::mm_idx_stat] kmer size: 15; skip: 5; is_hpc: 0; #seq: 194
[M::mm_idx_stat::109.605*3.38] distinct minimizers: 167225302 (35.46% are singletons); average occurrences: 6.030; average spacing: 3.074; total length: 3099750718
1       1077    101     1015    +       1       248956422       10975125        11109332        438     920     39      NM:i:482        ms:i:332        AS:i:182      nn:i:0  ts:A:-  tp:A:P  cm:i:50 s1:i:219        s2:i:101        de:f:0.0894     rl:i:320        cg:Z:61M2I99M2D1M1I9M1D24M183N437I131864N72M1I11M2D14M1I32M680N106M1003N15M1I18M1D9M\

You can see that after the first exon, 101S61M2I99M2D1M1I9M1D24M, there are 183N skip, followed by 437I insertion, followed by another skip of 131864N.

baraaorabi commented 1 year ago

P.S.: This might be related to #502?

lh3 commented 1 year ago

Yes, related but not the same. We may move 437I around in this case, though there will still be a CIGAR like xxxIyyyN, which is hard to resolve.

baraaorabi commented 1 year ago

Can it not be resolved by linear time postprocessing? Something like anochoring the (mis)matches in the CIGAR string, and then sorting-and-merging any other CIGAR operations between (mis)match anchors?

lh3 commented 1 year ago

No, can't.