lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
https://lh3.github.io/minimap2
Other
1.77k stars 404 forks source link

Forcing continuous alignments #700

Open rlorigro opened 3 years ago

rlorigro commented 3 years ago

Hi,

I am using minimap2 to map long ONT reads (up to 400Kbp), and I am finding that the supplementary chaining is producing large rearrangements in the read sequence, flipping strands, and jumping across regions. I have tried using the -m parameter to limit chaining, and several other parameters, but none of them seem to force the primary alignment to be linear and unidirectional.

I should emphasize that the reference I am aligning to is the same individual that the reads are derived from, so I expect there to be no SVs. Here is one example:

Read       Length Start  Stop      Region RefLength  Start      Stop
2626312    448022 112    922    +  chr2   242696747  130488450  130489279
2626312    448022 7047   7861   +  chr2   242696747  130489870  130490701
2626312    448022 41024  447988 +  chr2   242696747  130533922  130950797
2626312    448022 3105   3985   +  chr2   242696747  130885322  130886221
2626312    448022 12041  40959  -  chr2   242696747  131672947  131702229
2626312    448022 5227   5854   -  chr2   242696747  131710435  131711099
2626312    448022 2039   2944   -  chr2   242696747  131707727  131708642

Is there a parameter (or combination of parameters) that would help with this? I think what is needed is to allow the primary alignment to extend further.

Thanks

rlorigro commented 3 years ago

I should also add that this is a repetitive region, and that I have been using winnowmap to accommodate for that.

lh3 commented 3 years ago

What is the winnowmap2 output?

rlorigro commented 3 years ago

The example I posted was using winnowmap already. In that case, I aligned HG002 reads to CHM13 v1.0, but in future runs I aligned HG002 reads to the HiFiasm HG002 assembly (~190x CCS coverage from here) and observed a similar problem, in other reads. More detailed information:

Command:

meryl count k=15 output merylDB /home/ryan/data/hifi/human/hg002/HG002.mat.fa
meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt

winnowmap \
-W /home/ryan/data/hifi/human/hg002/repetitive_k15.txt \
-x map-ont \
-t 7 \
--secondary=no \
/home/ryan/data/hifi/human/hg002/HG002.mat.fa \
/home/ryan/data/nanopore/human/test/overlap_guppy_360_tangle_subset_1/HG002-Guppy-3.6.0-UL-run3-subset1.fasta \
> HG002-Guppy-3.6.0-UL-run3-subset1_VS_HG002_mat_primary.paf

stderr:

[M::mm_idx_gen::0.035*0.04] reading downweighted kmers
[M::mm_idx_gen::0.065*0.24] collected downweighted kmers, no. of kmers read=66775
[M::mm_idx_gen::0.065*0.24] saved the kmers in a bloom filter: hash functions=2 and size=960072 
[M::mm_idx_gen::98.885*1.08] collected minimizers
[M::mm_idx_gen::100.291*1.16] sorted minimizers
[M::main::100.291*1.16] loaded/built the index for 546 target sequence(s)
[M::main::100.291*1.16] running winnowmap in SV-aware mode
[M::main::100.291*1.16] stage1-specific parameters minP:1000, incP:4.00, maxP:16000, sample:1000, min-qlen:10000, min-qcov:0.5, min-mapq:5, mid-occ:5000
[M::main::100.291*1.16] stage2-specific parameters s2_bw:2000, s2_zdropinv:25
[M::mm_idx_stat] kmer size: 15; skip: 50; is_hpc: 0; #seq: 546
[M::mm_idx_stat::100.479*1.16] distinct minimizers: 23589668 (41.44% are singletons); average occurrences: 5.194; average spacing: 25.013
[M::worker_pipeline::111.965*1.78] mapped 2273 sequences
[M::main] Version: 2.0, pthreads=7, omp_threads=3
[M::main] CMD: winnowmap -W /home/ryan/data/hifi/human/hg002/repetitive_k15.txt -x map-ont -t 7 --secondary=no /home/ryan/data/hifi/human/hg002/HG002.mat.fa /home/ryan/data/nanopore/human/test/overlap_guppy_360_tangle_subset_1/HG002-Guppy-3.6.0-UL-run3-subset1.fasta
[M::main] Real time: 112.009 sec; CPU: 199.570 sec; Peak RSS: 3.884 GB

And some weird alignments I observed:

Jumping between contigs (not bridging contig termini):

736840  41280   2057    39942   +   h2tg000102l 36105583    464829  502652  7336    38234   60  tp:A:P  cm:i:541    s1:i:7250   s2:i:802    rl:i:0
736840  41280   75  968 +   h2tg000005l 111658231   629986  630886  327 910 60  tp:A:P  cm:i:27 s1:i:324    s2:i:0  rl:i:0

Reversing strand on the same contig:

207028  74088   33102   73900   +   h2tg000102l 36105583    3378    44916   7102    41787   60  tp:A:P  cm:i:510    s1:i:6921   s2:i:0  rl:i:0
207028  74088   70  36902   -   h2tg000102l 36105583    43512   80920   5749    37648   60  tp:A:P  cm:i:415    s1:i:5587   s2:i:194    rl:i:0

Jumping intra + inter contig

2027405 47862   34  37932   -   h2tg000005l 111658231   106444  145003  6510    38787   60  tp:A:P  cm:i:481    s1:i:6327   s2:i:968    rl:i:0
2027405 47862   39262   39765   -   h2tg000005l 111658231   712065  712577  90  519 56  tp:A:P  cm:i:6  s1:i:85 s2:i:0  rl:i:0
2027405 47862   41040   47808   -   h2tg000102l 36105583    146693  153675  614 7023    60  tp:A:P  cm:i:43 s1:i:561    s2:i:0  rl:i:0
rlorigro commented 3 years ago

There could still be SVs because the reads are not binned, and this is the maternal haplotype from HiFiasm, but in general I would just like to forbid chaining with large gaps and favor extending the primary alignment.

lh3 commented 3 years ago

If you can give me just these reads, I can have a look. Nonetheless, I suspect those are the best you can get.

I should emphasize that the reference I am aligning to is the same individual that the reads are derived from, so I expect there to be no SVs.

Assembly can be wrong. Reads can be chimeric both locally and globally.

rlorigro commented 3 years ago

OK, here are the 3 reads from above: https://rlorigro-public-files.s3-us-west-1.amazonaws.com/reads/guppy_360/misaligned_reads_guppy_3.6.0_HG002.fasta

There are many more examples like these so I would be surprised if they are all chimeras. These reads are also dumped from Shasta, which is usually done after some filtering for chimerism. I think the issue is that there are repeats in this region, and the primary/supplementary segments are being split amongst the repeats.

Also I noticed that winnowmap is running with "sv-aware mode", but I couldn't find more info about this. Does this affect chaining and supplementaries? Maybe a question for @cjain7

lh3 commented 3 years ago

I guess --sv-aware implements the algorithm in the winnowmap2 preprint.

rlorigro commented 3 years ago

Ok i will take a closer look at the paper. I just tried rerunning with --sv-off and it appears to have reduced these rearrangements.

rlorigro commented 3 years ago

With sv-off I do still get some strange ones though, unfortunately:

2497450 60236   146 29307   -   h2tg000005l 111658231   848222  878809  3055    30728   46  tp:A:P  cm:i:219    s1:i:2764   s2:i:2331   rl:i:636
2497450 60236   12923   60193   -   h2tg000102l 36105583    312140  361719  8282    49837   60  tp:A:P  cm:i:616    s1:i:7919   s2:i:543    rl:i:636

After looking some more at the paper, I am wondering if disabling sv-aware mode will defeat the purpose of using winnowmap over minimap.

ekg commented 3 years ago

@rlorigro wfmash -m -N -p 90 will generate contiguous approximate mappings of nanopore reads. You can remove -m to get a base-level alignment obtained by global pairwise alignment of each mapping result. wfmash lacks mapping quality, but it could provide you a baseline here. We're just beginning tests on non-assembly inputs, so please let me know if you try it and it works for you.

rlorigro commented 3 years ago

Thanks @ekg, this is interesting. As it turns out, I've been digging into these issues more and have found that @lh3 was right about many of them being chimeric. A majority of the chimeras were palindromes, where they reverse direction about halfway through the sequence, and a smaller portion of them were conventional chimeras, where a chunk of the read appears to have been a concatenation of totally unrelated sequence. For the palindromes, it seems that nanopore has issues with pulling in the complementary strand directly after the first strand finishes translocating, and the signal segmenter doesn't catch the transition, so both strands of the signal are sent to the basecaller as a single read.

There are occasionally still some unexplained artifacts, where supplementaries are almost entirely contained inside other alignments, but fortunately the mapQ score for these is very low or 0 in the cases I have observed so far.

So overall I would say the mappings were not issues with winnowmap (if SV mode is turned off), though I think it should still be possible to force more strict chaining.

rlorigro commented 3 years ago

To be more specific about the remaining non-chimeric artifacts I mentioned, here are 2 examples:

14617   59497   29  59436   +   h2tg000052l 87420292    61492027    61545226    18078   59881   60  tp:A:P  cm:i:1359   s1:i:17961  s2:i:255    rl:i:1688
14617   59497   38972   44839   +   h2tg000043l 49469036    33482998    33488884    1103    5906    2   tp:A:P  cm:i:83 s1:i:1097   s2:i:1085   rl:i:1688
14617   59497   36265   41733   -   h2tg000036l 139977256   27376259    27381830    75  5589    14  tp:A:P  cm:i:5  s1:i:50 s2:i:0  rl:i:1688
1181401 72122   105 72059   -   h2tg000052l 87420292    61507216    61573955    10694   73316   60  tp:A:P  cm:i:775    s1:i:10400  s2:i:40 rl:i:1116
1181401 72122   43191   48674   -   h2tg000052l 87420292    43644377    43649923    488 5563    0   tp:A:P  cm:i:33 s1:i:472    s2:i:472    rl:i:1116
1181401 72122   40607   46177   -   h2tg000077l 81620956    67987710    67993283    60  5590    13  tp:A:P  cm:i:4  s1:i:51 s2:i:0  rl:i:1116

They aren't overlapping by more than 50% in the reference (as specified by the -M parameter), but in the query coordinate space, some of these mappings are contained entirely inside other mappings. The final supplementary alignment of read 1181401 has a surprisingly high mapQ of 13 despite this.