marbl / Winnowmap

Long read / genome alignment software
Other
235 stars 22 forks source link

Mapping simulated T2T reads #11

Closed lh3 closed 2 years ago

lh3 commented 3 years ago

TL;DR: on simulated reads, winnowmap seems to produce more mapping errors than minimap2.

I simulated reads with pbsim2:

src/pbsim --depth 1 --hmm_model data/R94.model --length-mean 20000 --accuracy-mean 0.95 --length-min 5000 CHM13v1.fa
paftools.js pbsim2fq CHM13v1.fa.fai *.maf | pigz -p8 > pbsim-CHM13-R94.fa.gz

This supposedly simulates reads similar to nanopore of the current generation. The second command line converts the pbsim2 output to the FASTA format needed by paftools.js mapeval. I then mapped these reads to CHM13v1 with the following mappers/settings:

winnowmap -W CHM13-rep-k15.txt -t4 -cxmap-ont CHM13v1.fa pbsim-CHM13-R94.fa.gz > pbsim-wm-ont.paf
winnowmap -W CHM13-rep-k15.txt -t4 -cxmap-pb  CHM13v1.fa pbsim-CHM13-R94.fa.gz > pbsim-wm-pb.paf
minimap2 -t4 -cxmap-ont CHM13v1.fa pbsim-CHM13-R94.fa.gz > pbsim-mm-ont.paf

Here are the mapeval output for pbsim-wm-ont.paf:

Q       60      149867  32      0.000213523     149867
...
Q       5       49      4       0.000536197     151064
...
Q       1       838     55      0.000947051     152051
Q       0       203     98      0.001589449     152254

for pbsim-wm-pb.paf:

Q       60      149929  89      0.000593614     149929
...
Q       5       32      4       0.001474439     151244
...
Q       1       735     93      0.002215968     152078
Q       0       176     83      0.002758548     152254

and for pbsim-mm-ont.paf:

Q       60      149294  0       0.000000000     149294
Q       4       472     1       0.000013274     150675
Q       2       266     1       0.000019875     150941
Q       1       1165    36      0.000256400     152106
Q       0       216     107     0.000958496     152322

Notably, minimap2 maps slightly more reads and has higher accuracy at the Q60, Q5, Q1 and Q0 mapping quality thresholds.

I guess the large minimizer window size used by winnowmap is affecting its accuracy.

lh3 commented 3 years ago

I have also tried winnowmap2 and minimap2 on real CHM13 data. I mapped the first 1 million nanopore reads in rel7 to CHM13, counted read start positions in each 100kb window (only the best scoring hit for each read), and collected outlier windows. On this test, winnowmap2 and minimap2 are about the same in terms of the number of outlier windows, though their outliers are sometimes different. I have done a similar experiment with one run of HiFi reads. The result is similar: the number of outlier windows are comparable between winnowmap2 and minimap2.

yipukangda commented 3 years ago

@lh3 Hi, I have also found that the mapping sensitivity of winnowmap2 seems lower than minimap2, as this issue.

lh3 commented 3 years ago

@yipukangda I guess winnowmap is optimized for recent long reads which have decent accuracy. If you use older nanopore chemistry or older base callers, the high error rate may affect the mapping sensitivity.

cjain7 commented 3 years ago

@lh3 Thanks for sharing these results; I will look into it and check whether higher minimizer window size is the culprit. You're right that minimap2 does slightly better using map-ont mode in this particular simulation experiment.

BTW, the map-pb mode was made specific to hifi reads in Winnowmap; it has a separate map-pb-clr mode.

lh3 commented 3 years ago

Correction: I just noticed that winnowmap also outputs secondary alignment just at a much lower frequency. I have updated the results. Now minimap2 and winnowmap racon results are about the same, though htsbox analysis still favors minimap2 a little.

Here is a comparison for CHM13 HiFi data. I mapped HiFi reads to T2T v1.0 with

minimap2 -ax asm20 --secondary=no -t16 CHM13v1.fa reads.fq > mm.sam
winnowmap -ax map-pb --secondary=no -W CHM13-rep-k15.txt -t16 CHM13v1.fa reads.fq > wm.sam

Then I ran racon to generate consensus and mapped the consensus to T2T to call "variants":

racon -t16 reads.fq mm.sam CHM13v1.fa > wm.racon.fa
minimap2 -t 24 -cxasm5 -K1.9g --cs -r2k CHM13v1.fa wm.racon.fa > wm.racon.asm5.paf
sort -k6,6 -k8,8n wm.racon.asm5.paf | paftools.js call - > wm.racon.asm5.var 2> wm.racon.asm5.vst

Here is the .vst for minimap2:

3041564123 reference bases covered by exactly one contig
3453 substitutions; ts/tv = 0.798
1588 1bp deletions
25392 1bp insertions
461 2bp deletions
3463 2bp insertions
659 [3,50) deletions
1368 [3,50) insertions
29 [50,1000) deletions
38 [50,1000) insertions
0 >=1000 deletions
1 >=1000 insertions

and for winnowmap:

3041521256 reference bases covered by exactly one contig
3329 substitutions; ts/tv = 0.775
1659 1bp deletions
25475 1bp insertions
454 2bp deletions
3446 2bp insertions
626 [3,50) deletions
1347 [3,50) insertions
30 [50,1000) deletions
46 [50,1000) insertions
0 >=1000 deletions
1 >=1000 insertions

If we assume the T2T reference is perfect, minimap2 is no worse than winnowmap.

The process above involves racon and assembly-to-reference alignment. In case these two steps going wrong, I tried a simpler version of variant calling:

samtools sort -o wm.bam -@4 -m4g wm.sam
htsbox pileup -vcf CHM13v1.fa -s5 -q5 -Q30 wm.bam > wm.vcf  # baseQ>=30, mapQ>=5, at least 5 supporting reads

In this analysis, minimap2 also reports fewer "SNPs" and fewer overall "variants", and reports fewer "SNPs" with coverage >=70.

cjain7 commented 3 years ago

Thanks! These observations are useful to know.

The simulation settings we chose in in Winnowmap2 preprint were different from the above, we wanted to showcase better mappability in presence of non-reference alleles. For this, we had added SVs (using SURVIVOR) in reference before beginning the mapping process. More details are available in Winnowmap2 preprint. Precise commands and software versions used are available in its supplement.

lh3 commented 3 years ago

Thanks for the response. What I did is like sanity check. It doesn't address SVs. On your preprint, I think a basic assumption is that PSVs are fixed in the human population. When PSVs are polymorphic, you will have an event like below:

-TTCG-TTCG- (ancestral)
-TTCG-TTXG- (sample)
-ATCG-TTXG- (reference)

Here the SV X was introduced first and then the first A came into the population. There is not an SV between "sample" and "reference". For reads coming from the first copy in "sample", winnowmap2 will place them onto the second copy on the reference (if I am right). Note that the first A is polymorphic. The SURVIVOR simulation is not considering this case. It generates data that perfectly fit your assumption.

For recent duplications, polymorphic PSVs may happen often with a simple mutation process. For a bit ancient duplications, there can be non-allelic gene conversions that copy sequences between duplications, leading to polymorphic PSVs. I don't know how prevalent such events are, though. What matters is the accuracy on real data.

If the goal is SV calling at a higher sensitivity, a simpler strategy is to recompute alignment scores with a better concave gap cost and rank alignments with the new score. If you count an INDEL as one edit regardless of its length, you are more likely to correctly map a read involving an SV. Smith-Waterman may lead to a wrong alignment mostly because its affine-gap cost doesn't properly model the mutation process of SVs.

No matter how hard we try, there will always be ambiguity in read mapping within segupds. The ultimate solution is assembly such that we can put SVs in the context of long unique sequences.

lh3 commented 2 years ago

I am closing this thread as it is outdated. I have a new preprint on recent improvements in minimap2 if you are interested.