lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
https://lh3.github.io/minimap2
Other
1.8k stars 412 forks source link

Invalid Overlap while running canu correction on RNASeq files with minimap2 #30

Closed gringer closed 7 years ago

gringer commented 7 years ago

I'm getting 'INVALID OVERLAP' messages from canu when running read correction on RNASeq files with minimap2. Here's one of the outputs:

$ cat  mmap.000015.out 
Running job 15 based on command line options.
Fetch blocks/000006.fasta
Fetch blocks/000007.fasta
[M::main::1.697*1.00] loaded/built the index for 192000 target sequence(s)
[M::mm_mapopt_update::2.170*1.00] mid_occ = 119
[M::mm_idx_stat] kmer size: 17; skip: 11; is_HPC: 0; #seq: 192000
[M::mm_idx_stat::2.481*1.00] distinct minimizers: 21803167 (90.81% are singletons); average occurrences: 1.405; average spacing: 6.096
[M::worker_pipeline::151.206*11.13] mapped 192000 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/canu/canu-1.6/Linux-amd64/bin/minimap2 -x ava-ont -c -k17 -w11 -t 12 ./blocks/000005.mmi ./blocks/000005.fasta
[M::main] Real time: 151.244 sec; CPU: 1682.280 sec
[M::main::1.601*1.00] loaded/built the index for 192000 target sequence(s)
[M::mm_mapopt_update::2.029*1.00] mid_occ = 119
[M::mm_idx_stat] kmer size: 17; skip: 11; is_HPC: 0; #seq: 192000
[M::mm_idx_stat::2.343*1.00] distinct minimizers: 21803167 (90.81% are singletons); average occurrences: 1.405; average spacing: 6.096
[M::worker_pipeline::236.382*11.22] mapped 192000 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/canu/canu-1.6/Linux-amd64/bin/minimap2 -x ava-ont -c -k17 -w11 -t 12 ./blocks/000005.mmi queries/000015/000006.fasta
[M::main] Real time: 236.419 sec; CPU: 2652.084 sec
[M::main::1.630*1.00] loaded/built the index for 192000 target sequence(s)
[M::mm_mapopt_update::2.084*1.00] mid_occ = 119
[M::mm_idx_stat] kmer size: 17; skip: 11; is_HPC: 0; #seq: 192000
[M::mm_idx_stat::2.399*1.00] distinct minimizers: 21803167 (90.81% are singletons); average occurrences: 1.405; average spacing: 6.096
[M::worker_pipeline::278.415*11.23] mapped 192000 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/canu/canu-1.6/Linux-amd64/bin/minimap2 -x ava-ont -c -k17 -w11 -t 12 ./blocks/000005.mmi queries/000015/000007.fasta
[M::main] Real time: 278.465 sec; CPU: 3126.120 sec
INVALID OVERLAP   775356 (len   2736)   768001 (len   1596) hangs   1205   1300 - 2093699   4806 flip 1

The offending minimap2 line for this overlap step seems to be the following (headers from canu source code, minimap/mmapConvert.C have been added):

aiid    alen     bgn     end bori         biid  blen     bgn     end #match   minimizers   alnlen     cm:i:errori
775356  2736    1205    1436    -       768001  1596    4806    5049    192          259        0          tp:A:I  cm:i:0  s1:i:0  s2:i:0  NM:i:67 ms:i:96 AS:i:96 nn:i:0  cg:Z:19M1D4M1D2M1D12M1I4M2D16M1I5M2D3M1I25M6I3M1I7M1D6M3D9M1I2M1I15M3I5M1I5M1D2M2D5M1D4M2D25M1D7M1D3M2D10M2D2M2D2M1D6M2D7M

I can't easily work out if this is a minimap2 issue or a canu issue. It seems a bit odd that the start and end of a match region exceed the sequence length (i.e. blen 1596, bgn 4806; end 5049), but that could just be an unexpected file format change between minimap and minimap2.

Metadata

Canu v1.6 command:

~/install/canu/canunu-1.6/Linux-amd64/bin/canu -correct -nanopore-corrected fastq_runid_*.fastq -p Olivier_GL261_cDNA_2017-Aug-04 -d Olivier_GL261_cDNA_2017-Aug-04 corOutCoverage=all minReadLength=100 minOverlapLength=50 genomeSize=500M overlapper=minimap

The minimap2 executable was copied into the canu directory:

$ ~/install/canu/canu-1.6/Linux-amd64/bin/minimap2  -V
2.2-r424-dirty

Running on Debian Linux: Linux elegans 4.9.0-3-amd64 #1 SMP Debian 4.9.25-1 (2017-05-02) x86_64 GNU/Linux

gringer commented 7 years ago

The display code from format.c:

   195  void mm_write_paf(kstring_t *s, const mm_idx_t *mi, const mm_bseq1_t *t, const mm_reg1_t *r, void *km, int opt_flag)
   196  {
   197          s->l = 0;
   198          mm_sprintf_lite(s, "%s\t%d\t%d\t%d\t%c\t", t->name, t->l_seq, r->qs, r->qe, "+-"[r->rev]);
   199          if (mi->seq[r->rid].name) mm_sprintf_lite(s, "%s", mi->seq[r->rid].name);
   200          else mm_sprintf_lite(s, "%d", r->rid);
   201          mm_sprintf_lite(s, "\t%d\t%d\t%d", mi->seq[r->rid].len, r->rs, r->re);
   202          if (r->p) mm_sprintf_lite(s, "\t%d\t%d", r->p->blen - r->p->n_ambi - r->p->n_diff, r->p->blen);
   203          else mm_sprintf_lite(s, "\t%d\t%d", r->fuzzy_mlen, r->fuzzy_blen);
   204          mm_sprintf_lite(s, "\t%d", r->mapq);
   205          write_tags(s, r);
   206          if (r->p && (opt_flag & MM_F_OUT_CG)) {
   207                  uint32_t k;
   208                  mm_sprintf_lite(s, "\tcg:Z:");
   209                  for (k = 0; k < r->p->n_cigar; ++k)
   210                          mm_sprintf_lite(s, "%d%c", r->p->cigar[k]>>4, "MIDN"[r->p->cigar[k]&0xf]);
   211          }
   212          if (r->p && (opt_flag & MM_F_OUT_CS))
   213                  write_cs(km, s, mi, t, r);
   214  }

It seems a bit odd that it's mi->seq[r->rid].len instead of r->l_seq, when r->qs and r->qe are used.

lh3 commented 7 years ago

This does seem like a minimap2 bug. However, I need your help to move further. Could you add the following lines to mm_write_paf():

if (!(r->rs < r->re && r->re <= mi->seq[r-rid].len)) {
  fprintf(stderr, "%s <=> %s\n", t->name, mi->seq[r->rid].name);
  abort();
}

This essentially raises an assertion failure and prints out violating sequence pairs if coordinates are incorrect. By the way, there is no mm_reg1_t::l_seq. mi->seq[r->rid].len gets the reference sequence length. That part of code is correct.

Thanks!

gringer commented 7 years ago

Output from canu (from a different failed run):

$ cat mmap.000005.out 
Running job 5 based on command line options.
Fetch blocks/000003.fasta
Fetch blocks/000004.fasta
[M::main::2.008*1.00] loaded/built the index for 192000 target sequence(s)
[M::mm_mapopt_update::2.603*1.00] mid_occ = 126
[M::mm_idx_stat] kmer size: 17; skip: 11; is_HPC: 0; #seq: 192000
[M::mm_idx_stat::2.945*1.00] distinct minimizers: 21028426 (90.26% are singletons); average occurrences: 1.438; average spacing: 6.098
370985 <=> 192001
Aborted
[M::main::1.864*1.00] loaded/built the index for 192000 target sequence(s)
[M::mm_mapopt_update::2.369*1.00] mid_occ = 126
[M::mm_idx_stat] kmer size: 17; skip: 11; is_HPC: 0; #seq: 192000
[M::mm_idx_stat::2.716*1.00] distinct minimizers: 21028426 (90.26% are singletons); average occurrences: 1.438; average spacing: 6.098
[M::worker_pipeline::6.438*5.90] mapped 192000 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/canu/canu-1.6/Linux-amd64/bin/minimap2 -x ava-ont -c -k17 -w11 -t 12 ./blocks/000002.mmi queries/000005/000003.fasta
[M::main] Real time: 6.461 sec; CPU: 38.024 sec
[M::main::1.520*1.00] loaded/built the index for 192000 target sequence(s)
[M::mm_mapopt_update::2.037*1.00] mid_occ = 126
[M::mm_idx_stat] kmer size: 17; skip: 11; is_HPC: 0; #seq: 192000
[M::mm_idx_stat::2.412*1.00] distinct minimizers: 21028426 (90.26% are singletons); average occurrences: 1.438; average spacing: 6.098
[M::worker_pipeline::6.041*6.39] mapped 192000 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/canu/canu-1.6/Linux-amd64/bin/minimap2 -x ava-ont -c -k17 -w11 -t 12 ./blocks/000002.mmi queries/000005/000004.fasta
[M::main] Real time: 6.087 sec; CPU: 38.676 sec
mv: cannot stat './results/000005.mmap.counts': No such file or directory

Edit: More interesting is that this modification of minimap2 allows canu to continue because the offending lines don't appear in the output (even though the output is incomplete...).

lh3 commented 7 years ago

Could you extract the sequence pair "370985" and "192001", or could you send me the two blocks? I have run minimap2 on some pacbio reads, but could not reproduce the issue on my dataset. Without the sequences that trigger the bug, it is quite difficult to fix it. Thanks.

gringer commented 7 years ago

This is weird. It's not failing when I run with just those two sequences, but it is for the whole block (both indexed and non-indexed). Anyway, here are the two sequences:

>370985
TGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTAATTCGTTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGTGTGTGTTGGTGGTGTGTGTGTGTGG
AGTGTGTGTGTGTGTGGAGTGTGTGTGTGTGTTGGTGTGTGGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGGAGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGAGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTTGGTGATGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGAG
TGTGTGTGTGTGTGTGGTGTGTGTGTGTGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGT
GTGTGGTGATCAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGATCATTGATGATCAATTGATGATTGATGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTTTGGTGATGTTGTTGTTGGTATTGTTATTGTTATGTTGTTGTTGTTGTTGTGTTATT
GTTGGTATTATTATTGTTATTCAGTATTGTTATTGTGTGTATTATTCATTGTTGATGTGC
TATTATTGTTATTGTTATTGTTATTGTTGTTATTATTTATTGTTCCGCTTCCTTCCTTTG
TACTTGTTATTCTTATTCCATTTCTTCCCTTCTTCCAGCCCCATGCTGCCACCTTATTGT
AATACCCATTAACCAACCTTCGCCTTCGCCCAACTACTGCCTTCCTTCGCCTTAACCTTC
CTTAGCCTTAGCCAGCCTTCCAACCTTCTTAACTGGCCAGCCTTAACCCAGCCCACGCCC
ACCTTCTTCCTTCTTCGCTAACCTTCGCCTTGTGCCTTAACCTTTAACCTTGTCCAGCCA
GCCAGCCAGCCAGCCTTCTTCTTCCTTCCTTCTTCCGGCCTTCGCTATTGGCCTTGTGTG
TTATTGTGTATTTATTGTATGTGTATTATTGTGCCTTCTTCTTATGTGTATTATTGTATT
ATTGTATTGTATGTTGTATTATTGTGTGTATTGGCCTTCTTCTTCTTCTTCTTTACAACC
CCACCCTTCTTCCCTTTATTGTACCCACACCCTTCCCATGCCACCTTTCTTATTGTGTTA
ACCTTGTATTCCTTGTGTGTTATTGTGTTATTCTTATTGTGGCCATTATTGTGTTATTAT
TTATTATTTTATTGTATTCTATTGTGTGTATTGTGTGTGTGTGTGTGTGTATTTGTTATT
GCTTGACTGTATTGTGTGCCATTGTGTGTGACCTTACCTTATGTTCCAGTACTAGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGCTGTGTGTGTGTGTGTGTGTGTGTGTGTATTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTTGTGTTATTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTAGTATTGTGTGTGTGTGTGTGTGTGTGTGTGTTATTATTGTTATTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTTATTGTGTGTTATTATTATTGTGTGTGTATTG
TGTATTATTATTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGTTATTGGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGTTGGTGGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGGTGTG
TGTGTGTGTGTGTGTGTGTGTGTCAAGTGTGTGTGTGTGTGTGTGGTGTGTGTGTGTGTG
TGAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGAAGTGTGTGTGTG
TGATCAGTGTGTCAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGGTGGGAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTTG
>192001
TTGTTGTACTTCGTTCGGTTACATATTGCTGGCGTCTGCTTATCGAGCAAAAGGGGTTCC
AAGTCAGAGGTTCCTTTTCTGTTGGACAACGATGTTGCTGCCATTACGGCGGACCGGCTT
GGTGGGATGGCGCTGCAGGTCGGTTCGCTTCTGCGCCGTCTCCAGAGCGTCTTTCCGTTC
CCCGCGCGGAACGCGCTCGTGAACGCGCCACACGCCAGGGGCCATGCGACGAGGGCGGCC
GAAGGCTCTACATTGAGGCAACATCGGATGGCACTTCAAACGTCTTGGAAACTTGCTGAA
ATGATGTACCAAGGGTCCCAGCACGATGGTCTACACACATTCCAGACCCTTTCTTTCCTT
GAGCCGTCTGGAGGTGCAGCTGGAGCCCATCAGGAGACTCCCTGCAGGCAGAGAAGTCTG
TGCAGGTCCCAGAGAGGTCTGTGTACAGTGACAGGTATATCTTTGCAAAAGAATCTATTT
GAAAATGGCTCCCTCAGTGACATCAGGTGGCATCTATCAGGACTGGCACTCCTTTCTCCT
GCAGGGAGTTCCGCAAACCAGCTTTGTTACACGGCTTCATCTATCTCCAGGCTTCACCCA
GGTTTGTGGAGAGACTCTACCAGAGATGAAGAAGAAAAAGGACATTGAGCTGGCCTATCT
TCAACAGCCCCATAGTCAGCACGAAGACTGGTTTTATTAACAAGACTACCAAGCTCCGCT
GAAGCTCTGCAGCACGTGCCAGTGCTGGTGCTGGATCCACTGAGGCTTCCACTCTGAAAA
TGCCCGCCAGGCAGGGGAGAGCTCATGGGACAGGTGAACACCTTTATGAGGAACTTAACA
ATAAGGACGTGGCTGTGGTCTGGACTCTGACTTTCTCAGATCTGCGACTGCTCTTATGCA
CATCCCCTTTGGCCCCTCAGCCGCCCCCACACCTGGGTGTGTGCTGTCTTTATTACTGCA
TTGTTTTGTTTTTTAATCTCAAGCACCTTGAAATAAGAAAGCTGACTTTGCTATTTTTCT
AAAAAGATAGAGCGACAGGCAGAGGAACCTCTCTTGGAACCTCTCGGTTAAACACCCAAG
CAGACGCCAGCAATACGTAACGC

The entire indexed block can be found temporarily in my Dropbox code_scratch folder here. This index causes the issue when paired with the above highly-repetitive sequence.

lh3 commented 7 years ago

I have fixed the bug via 95eb1de. Thanks a lot for your example. It would be almost impossible without proper test cases.

By the way, usually I wouldn't recommend to perform alignment (option -c) in the overlapping mode. First, this is much slower. Second, it introduces false overlaps. On pacbio C. elegans reads, doing alignment leads to a few more misassemblies.

I am closing this issue. If you the see same problem again, feel free to reopen it.

gringer commented 7 years ago

Looking at the fix, does this mean that all reverse mappings were given the wrong ID? I'm just wondering if I should remap my sequences with the fixed minimap2.

lh3 commented 7 years ago

It is inversion, not reverse.