Open rlorigro opened 3 years ago
I should also add that this is a repetitive region, and that I have been using winnowmap to accommodate for that.
What is the winnowmap2 output?
The example I posted was using winnowmap already. In that case, I aligned HG002 reads to CHM13 v1.0, but in future runs I aligned HG002 reads to the HiFiasm HG002 assembly (~190x CCS coverage from here) and observed a similar problem, in other reads. More detailed information:
Command:
meryl count k=15 output merylDB /home/ryan/data/hifi/human/hg002/HG002.mat.fa
meryl print greater-than distinct=0.9998 merylDB > repetitive_k15.txt
winnowmap \
-W /home/ryan/data/hifi/human/hg002/repetitive_k15.txt \
-x map-ont \
-t 7 \
--secondary=no \
/home/ryan/data/hifi/human/hg002/HG002.mat.fa \
/home/ryan/data/nanopore/human/test/overlap_guppy_360_tangle_subset_1/HG002-Guppy-3.6.0-UL-run3-subset1.fasta \
> HG002-Guppy-3.6.0-UL-run3-subset1_VS_HG002_mat_primary.paf
stderr:
[M::mm_idx_gen::0.035*0.04] reading downweighted kmers
[M::mm_idx_gen::0.065*0.24] collected downweighted kmers, no. of kmers read=66775
[M::mm_idx_gen::0.065*0.24] saved the kmers in a bloom filter: hash functions=2 and size=960072
[M::mm_idx_gen::98.885*1.08] collected minimizers
[M::mm_idx_gen::100.291*1.16] sorted minimizers
[M::main::100.291*1.16] loaded/built the index for 546 target sequence(s)
[M::main::100.291*1.16] running winnowmap in SV-aware mode
[M::main::100.291*1.16] stage1-specific parameters minP:1000, incP:4.00, maxP:16000, sample:1000, min-qlen:10000, min-qcov:0.5, min-mapq:5, mid-occ:5000
[M::main::100.291*1.16] stage2-specific parameters s2_bw:2000, s2_zdropinv:25
[M::mm_idx_stat] kmer size: 15; skip: 50; is_hpc: 0; #seq: 546
[M::mm_idx_stat::100.479*1.16] distinct minimizers: 23589668 (41.44% are singletons); average occurrences: 5.194; average spacing: 25.013
[M::worker_pipeline::111.965*1.78] mapped 2273 sequences
[M::main] Version: 2.0, pthreads=7, omp_threads=3
[M::main] CMD: winnowmap -W /home/ryan/data/hifi/human/hg002/repetitive_k15.txt -x map-ont -t 7 --secondary=no /home/ryan/data/hifi/human/hg002/HG002.mat.fa /home/ryan/data/nanopore/human/test/overlap_guppy_360_tangle_subset_1/HG002-Guppy-3.6.0-UL-run3-subset1.fasta
[M::main] Real time: 112.009 sec; CPU: 199.570 sec; Peak RSS: 3.884 GB
And some weird alignments I observed:
Jumping between contigs (not bridging contig termini):
736840 41280 2057 39942 + h2tg000102l 36105583 464829 502652 7336 38234 60 tp:A:P cm:i:541 s1:i:7250 s2:i:802 rl:i:0
736840 41280 75 968 + h2tg000005l 111658231 629986 630886 327 910 60 tp:A:P cm:i:27 s1:i:324 s2:i:0 rl:i:0
Reversing strand on the same contig:
207028 74088 33102 73900 + h2tg000102l 36105583 3378 44916 7102 41787 60 tp:A:P cm:i:510 s1:i:6921 s2:i:0 rl:i:0
207028 74088 70 36902 - h2tg000102l 36105583 43512 80920 5749 37648 60 tp:A:P cm:i:415 s1:i:5587 s2:i:194 rl:i:0
Jumping intra + inter contig
2027405 47862 34 37932 - h2tg000005l 111658231 106444 145003 6510 38787 60 tp:A:P cm:i:481 s1:i:6327 s2:i:968 rl:i:0
2027405 47862 39262 39765 - h2tg000005l 111658231 712065 712577 90 519 56 tp:A:P cm:i:6 s1:i:85 s2:i:0 rl:i:0
2027405 47862 41040 47808 - h2tg000102l 36105583 146693 153675 614 7023 60 tp:A:P cm:i:43 s1:i:561 s2:i:0 rl:i:0
There could still be SVs because the reads are not binned, and this is the maternal haplotype from HiFiasm, but in general I would just like to forbid chaining with large gaps and favor extending the primary alignment.
If you can give me just these reads, I can have a look. Nonetheless, I suspect those are the best you can get.
I should emphasize that the reference I am aligning to is the same individual that the reads are derived from, so I expect there to be no SVs.
Assembly can be wrong. Reads can be chimeric both locally and globally.
OK, here are the 3 reads from above: https://rlorigro-public-files.s3-us-west-1.amazonaws.com/reads/guppy_360/misaligned_reads_guppy_3.6.0_HG002.fasta
There are many more examples like these so I would be surprised if they are all chimeras. These reads are also dumped from Shasta, which is usually done after some filtering for chimerism. I think the issue is that there are repeats in this region, and the primary/supplementary segments are being split amongst the repeats.
Also I noticed that winnowmap is running with "sv-aware mode", but I couldn't find more info about this. Does this affect chaining and supplementaries? Maybe a question for @cjain7
I guess --sv-aware implements the algorithm in the winnowmap2 preprint.
Ok i will take a closer look at the paper. I just tried rerunning with --sv-off
and it appears to have reduced these rearrangements.
With sv-off
I do still get some strange ones though, unfortunately:
2497450 60236 146 29307 - h2tg000005l 111658231 848222 878809 3055 30728 46 tp:A:P cm:i:219 s1:i:2764 s2:i:2331 rl:i:636
2497450 60236 12923 60193 - h2tg000102l 36105583 312140 361719 8282 49837 60 tp:A:P cm:i:616 s1:i:7919 s2:i:543 rl:i:636
After looking some more at the paper, I am wondering if disabling sv-aware mode will defeat the purpose of using winnowmap over minimap.
@rlorigro wfmash -m -N -p 90
will generate contiguous approximate mappings of nanopore reads. You can remove -m
to get a base-level alignment obtained by global pairwise alignment of each mapping result. wfmash
lacks mapping quality, but it could provide you a baseline here. We're just beginning tests on non-assembly inputs, so please let me know if you try it and it works for you.
Thanks @ekg, this is interesting. As it turns out, I've been digging into these issues more and have found that @lh3 was right about many of them being chimeric. A majority of the chimeras were palindromes, where they reverse direction about halfway through the sequence, and a smaller portion of them were conventional chimeras, where a chunk of the read appears to have been a concatenation of totally unrelated sequence. For the palindromes, it seems that nanopore has issues with pulling in the complementary strand directly after the first strand finishes translocating, and the signal segmenter doesn't catch the transition, so both strands of the signal are sent to the basecaller as a single read.
There are occasionally still some unexplained artifacts, where supplementaries are almost entirely contained inside other alignments, but fortunately the mapQ score for these is very low or 0 in the cases I have observed so far.
So overall I would say the mappings were not issues with winnowmap (if SV mode is turned off), though I think it should still be possible to force more strict chaining.
To be more specific about the remaining non-chimeric artifacts I mentioned, here are 2 examples:
14617 59497 29 59436 + h2tg000052l 87420292 61492027 61545226 18078 59881 60 tp:A:P cm:i:1359 s1:i:17961 s2:i:255 rl:i:1688
14617 59497 38972 44839 + h2tg000043l 49469036 33482998 33488884 1103 5906 2 tp:A:P cm:i:83 s1:i:1097 s2:i:1085 rl:i:1688
14617 59497 36265 41733 - h2tg000036l 139977256 27376259 27381830 75 5589 14 tp:A:P cm:i:5 s1:i:50 s2:i:0 rl:i:1688
1181401 72122 105 72059 - h2tg000052l 87420292 61507216 61573955 10694 73316 60 tp:A:P cm:i:775 s1:i:10400 s2:i:40 rl:i:1116
1181401 72122 43191 48674 - h2tg000052l 87420292 43644377 43649923 488 5563 0 tp:A:P cm:i:33 s1:i:472 s2:i:472 rl:i:1116
1181401 72122 40607 46177 - h2tg000077l 81620956 67987710 67993283 60 5590 13 tp:A:P cm:i:4 s1:i:51 s2:i:0 rl:i:1116
They aren't overlapping by more than 50% in the reference (as specified by the -M parameter), but in the query coordinate space, some of these mappings are contained entirely inside other mappings. The final supplementary alignment of read 1181401
has a surprisingly high mapQ of 13 despite this.
Hi,
I am using minimap2 to map long ONT reads (up to 400Kbp), and I am finding that the supplementary chaining is producing large rearrangements in the read sequence, flipping strands, and jumping across regions. I have tried using the
-m
parameter to limit chaining, and several other parameters, but none of them seem to force the primary alignment to be linear and unidirectional.I should emphasize that the reference I am aligning to is the same individual that the reads are derived from, so I expect there to be no SVs. Here is one example:
Is there a parameter (or combination of parameters) that would help with this? I think what is needed is to allow the primary alignment to extend further.
Thanks