Closed gringer closed 7 years ago
The display code from format.c
:
195 void mm_write_paf(kstring_t *s, const mm_idx_t *mi, const mm_bseq1_t *t, const mm_reg1_t *r, void *km, int opt_flag)
196 {
197 s->l = 0;
198 mm_sprintf_lite(s, "%s\t%d\t%d\t%d\t%c\t", t->name, t->l_seq, r->qs, r->qe, "+-"[r->rev]);
199 if (mi->seq[r->rid].name) mm_sprintf_lite(s, "%s", mi->seq[r->rid].name);
200 else mm_sprintf_lite(s, "%d", r->rid);
201 mm_sprintf_lite(s, "\t%d\t%d\t%d", mi->seq[r->rid].len, r->rs, r->re);
202 if (r->p) mm_sprintf_lite(s, "\t%d\t%d", r->p->blen - r->p->n_ambi - r->p->n_diff, r->p->blen);
203 else mm_sprintf_lite(s, "\t%d\t%d", r->fuzzy_mlen, r->fuzzy_blen);
204 mm_sprintf_lite(s, "\t%d", r->mapq);
205 write_tags(s, r);
206 if (r->p && (opt_flag & MM_F_OUT_CG)) {
207 uint32_t k;
208 mm_sprintf_lite(s, "\tcg:Z:");
209 for (k = 0; k < r->p->n_cigar; ++k)
210 mm_sprintf_lite(s, "%d%c", r->p->cigar[k]>>4, "MIDN"[r->p->cigar[k]&0xf]);
211 }
212 if (r->p && (opt_flag & MM_F_OUT_CS))
213 write_cs(km, s, mi, t, r);
214 }
It seems a bit odd that it's mi->seq[r->rid].len
instead of r->l_seq
, when r->qs
and r->qe
are used.
This does seem like a minimap2 bug. However, I need your help to move further. Could you add the following lines to mm_write_paf()
:
if (!(r->rs < r->re && r->re <= mi->seq[r-rid].len)) {
fprintf(stderr, "%s <=> %s\n", t->name, mi->seq[r->rid].name);
abort();
}
This essentially raises an assertion failure and prints out violating sequence pairs if coordinates are incorrect. By the way, there is no mm_reg1_t::l_seq
. mi->seq[r->rid].len
gets the reference sequence length. That part of code is correct.
Thanks!
Output from canu (from a different failed run):
$ cat mmap.000005.out
Running job 5 based on command line options.
Fetch blocks/000003.fasta
Fetch blocks/000004.fasta
[M::main::2.008*1.00] loaded/built the index for 192000 target sequence(s)
[M::mm_mapopt_update::2.603*1.00] mid_occ = 126
[M::mm_idx_stat] kmer size: 17; skip: 11; is_HPC: 0; #seq: 192000
[M::mm_idx_stat::2.945*1.00] distinct minimizers: 21028426 (90.26% are singletons); average occurrences: 1.438; average spacing: 6.098
370985 <=> 192001
Aborted
[M::main::1.864*1.00] loaded/built the index for 192000 target sequence(s)
[M::mm_mapopt_update::2.369*1.00] mid_occ = 126
[M::mm_idx_stat] kmer size: 17; skip: 11; is_HPC: 0; #seq: 192000
[M::mm_idx_stat::2.716*1.00] distinct minimizers: 21028426 (90.26% are singletons); average occurrences: 1.438; average spacing: 6.098
[M::worker_pipeline::6.438*5.90] mapped 192000 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/canu/canu-1.6/Linux-amd64/bin/minimap2 -x ava-ont -c -k17 -w11 -t 12 ./blocks/000002.mmi queries/000005/000003.fasta
[M::main] Real time: 6.461 sec; CPU: 38.024 sec
[M::main::1.520*1.00] loaded/built the index for 192000 target sequence(s)
[M::mm_mapopt_update::2.037*1.00] mid_occ = 126
[M::mm_idx_stat] kmer size: 17; skip: 11; is_HPC: 0; #seq: 192000
[M::mm_idx_stat::2.412*1.00] distinct minimizers: 21028426 (90.26% are singletons); average occurrences: 1.438; average spacing: 6.098
[M::worker_pipeline::6.041*6.39] mapped 192000 sequences
[M::main] Version: 2.2-r424-dirty
[M::main] CMD: /home/gringer/install/canu/canu-1.6/Linux-amd64/bin/minimap2 -x ava-ont -c -k17 -w11 -t 12 ./blocks/000002.mmi queries/000005/000004.fasta
[M::main] Real time: 6.087 sec; CPU: 38.676 sec
mv: cannot stat './results/000005.mmap.counts': No such file or directory
Edit: More interesting is that this modification of minimap2 allows canu to continue because the offending lines don't appear in the output (even though the output is incomplete...).
Could you extract the sequence pair "370985" and "192001", or could you send me the two blocks? I have run minimap2 on some pacbio reads, but could not reproduce the issue on my dataset. Without the sequences that trigger the bug, it is quite difficult to fix it. Thanks.
This is weird. It's not failing when I run with just those two sequences, but it is for the whole block (both indexed and non-indexed). Anyway, here are the two sequences:
>370985
TGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTAATTCGTTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGTGTGTGTTGGTGGTGTGTGTGTGTGG
AGTGTGTGTGTGTGTGGAGTGTGTGTGTGTGTTGGTGTGTGGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTCGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGGAGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGAGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTTGGTGATGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGAG
TGTGTGTGTGTGTGTGGTGTGTGTGTGTGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGT
GTGTGGTGATCAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGATCATTGATGATCAATTGATGATTGATGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTTTGGTGATGTTGTTGTTGGTATTGTTATTGTTATGTTGTTGTTGTTGTTGTGTTATT
GTTGGTATTATTATTGTTATTCAGTATTGTTATTGTGTGTATTATTCATTGTTGATGTGC
TATTATTGTTATTGTTATTGTTATTGTTGTTATTATTTATTGTTCCGCTTCCTTCCTTTG
TACTTGTTATTCTTATTCCATTTCTTCCCTTCTTCCAGCCCCATGCTGCCACCTTATTGT
AATACCCATTAACCAACCTTCGCCTTCGCCCAACTACTGCCTTCCTTCGCCTTAACCTTC
CTTAGCCTTAGCCAGCCTTCCAACCTTCTTAACTGGCCAGCCTTAACCCAGCCCACGCCC
ACCTTCTTCCTTCTTCGCTAACCTTCGCCTTGTGCCTTAACCTTTAACCTTGTCCAGCCA
GCCAGCCAGCCAGCCTTCTTCTTCCTTCCTTCTTCCGGCCTTCGCTATTGGCCTTGTGTG
TTATTGTGTATTTATTGTATGTGTATTATTGTGCCTTCTTCTTATGTGTATTATTGTATT
ATTGTATTGTATGTTGTATTATTGTGTGTATTGGCCTTCTTCTTCTTCTTCTTTACAACC
CCACCCTTCTTCCCTTTATTGTACCCACACCCTTCCCATGCCACCTTTCTTATTGTGTTA
ACCTTGTATTCCTTGTGTGTTATTGTGTTATTCTTATTGTGGCCATTATTGTGTTATTAT
TTATTATTTTATTGTATTCTATTGTGTGTATTGTGTGTGTGTGTGTGTGTATTTGTTATT
GCTTGACTGTATTGTGTGCCATTGTGTGTGACCTTACCTTATGTTCCAGTACTAGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGCTGTGTGTGTGTGTGTGTGTGTGTGTGTATTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTTGTGTTATTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTAGTATTGTGTGTGTGTGTGTGTGTGTGTGTGTTATTATTGTTATTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTTATTGTGTGTTATTATTATTGTGTGTGTATTG
TGTATTATTATTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGTTATTGGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTTGTTGGTGGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGGTGTG
TGTGTGTGTGTGTGTGTGTGTGTCAAGTGTGTGTGTGTGTGTGTGGTGTGTGTGTGTGTG
TGAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGAAGTGTGTGTGTG
TGATCAGTGTGTCAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGGTGGGAGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTG
TGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGTGT
GTGTGTGTGTTG
>192001
TTGTTGTACTTCGTTCGGTTACATATTGCTGGCGTCTGCTTATCGAGCAAAAGGGGTTCC
AAGTCAGAGGTTCCTTTTCTGTTGGACAACGATGTTGCTGCCATTACGGCGGACCGGCTT
GGTGGGATGGCGCTGCAGGTCGGTTCGCTTCTGCGCCGTCTCCAGAGCGTCTTTCCGTTC
CCCGCGCGGAACGCGCTCGTGAACGCGCCACACGCCAGGGGCCATGCGACGAGGGCGGCC
GAAGGCTCTACATTGAGGCAACATCGGATGGCACTTCAAACGTCTTGGAAACTTGCTGAA
ATGATGTACCAAGGGTCCCAGCACGATGGTCTACACACATTCCAGACCCTTTCTTTCCTT
GAGCCGTCTGGAGGTGCAGCTGGAGCCCATCAGGAGACTCCCTGCAGGCAGAGAAGTCTG
TGCAGGTCCCAGAGAGGTCTGTGTACAGTGACAGGTATATCTTTGCAAAAGAATCTATTT
GAAAATGGCTCCCTCAGTGACATCAGGTGGCATCTATCAGGACTGGCACTCCTTTCTCCT
GCAGGGAGTTCCGCAAACCAGCTTTGTTACACGGCTTCATCTATCTCCAGGCTTCACCCA
GGTTTGTGGAGAGACTCTACCAGAGATGAAGAAGAAAAAGGACATTGAGCTGGCCTATCT
TCAACAGCCCCATAGTCAGCACGAAGACTGGTTTTATTAACAAGACTACCAAGCTCCGCT
GAAGCTCTGCAGCACGTGCCAGTGCTGGTGCTGGATCCACTGAGGCTTCCACTCTGAAAA
TGCCCGCCAGGCAGGGGAGAGCTCATGGGACAGGTGAACACCTTTATGAGGAACTTAACA
ATAAGGACGTGGCTGTGGTCTGGACTCTGACTTTCTCAGATCTGCGACTGCTCTTATGCA
CATCCCCTTTGGCCCCTCAGCCGCCCCCACACCTGGGTGTGTGCTGTCTTTATTACTGCA
TTGTTTTGTTTTTTAATCTCAAGCACCTTGAAATAAGAAAGCTGACTTTGCTATTTTTCT
AAAAAGATAGAGCGACAGGCAGAGGAACCTCTCTTGGAACCTCTCGGTTAAACACCCAAG
CAGACGCCAGCAATACGTAACGC
The entire indexed block can be found temporarily in my Dropbox code_scratch folder here. This index causes the issue when paired with the above highly-repetitive sequence.
I have fixed the bug via 95eb1de. Thanks a lot for your example. It would be almost impossible without proper test cases.
By the way, usually I wouldn't recommend to perform alignment (option -c) in the overlapping mode. First, this is much slower. Second, it introduces false overlaps. On pacbio C. elegans reads, doing alignment leads to a few more misassemblies.
I am closing this issue. If you the see same problem again, feel free to reopen it.
Looking at the fix, does this mean that all reverse mappings were given the wrong ID? I'm just wondering if I should remap my sequences with the fixed minimap2.
It is inversion, not reverse.
I'm getting 'INVALID OVERLAP' messages from canu when running read correction on RNASeq files with minimap2. Here's one of the outputs:
The offending minimap2 line for this overlap step seems to be the following (headers from canu source code,
minimap/mmapConvert.C
have been added):I can't easily work out if this is a minimap2 issue or a canu issue. It seems a bit odd that the start and end of a match region exceed the sequence length (i.e. blen 1596, bgn 4806; end 5049), but that could just be an unexpected file format change between minimap and minimap2.
Metadata
Canu v1.6 command:
The
minimap2
executable was copied into the canu directory:Running on Debian Linux: Linux elegans 4.9.0-3-amd64 #1 SMP Debian 4.9.25-1 (2017-05-02) x86_64 GNU/Linux