lh3 / minimap2

A versatile pairwise aligner for genomic and spliced nucleotide sequences
https://lh3.github.io/minimap2
Other
1.79k stars 409 forks source link

Map non noisy ONT #1127

Closed Axze-rgb closed 7 months ago

Axze-rgb commented 11 months ago

Hello,

I have a question: according to Oxford nanopore their last cells produce very accurate reads. Does "map-ont" still work as the best setting to map those reads? I am asking because the manual still refers to "long noisy reads". Thanks for minimap2 and for your time.

Axze-rgb commented 11 months ago

Sorry I hadn't seen issue

https://github.com/lh3/minimap2/issues/1030#issue-1622249145

So, I understand that Dorado is accounted for now in the map-ont settings?

Thanks for all the work you are doing. Alex

iiSeymour commented 11 months ago

@Axze-rgb dorado aligner has not yet changed any of the index settings and when we do we would like them upstreamed here.

lh3 commented 11 months ago

For now, use map-ont. You can try -x map-hifi -w10 (HiFi scoring and k-mer length with more seeds) for Q20 reads but you need to have a way to evaluate whether that gives better results.

I hope I can find some time in the next several months to improve minimap2 a little bit. Along this I will be testing alternative scoring for v14 data.

lh3 commented 11 months ago

@iiSeymour When you find more appropriate parameters for aligning Q20 reads, I will be happy to add a new preset for that. This will also save me some time. Thanks!

Checunmily commented 8 months ago

For now, use map-ont. You can try -x map-hifi -w10 (HiFi scoring and k-mer length with more seeds) for Q20 reads but you need to have a way to evaluate whether that gives better results.

I hope I can find some time in the next several months to improve minimap2 a little bit. Along this I will be testing alternative scoring for v14 data.

hello, recently I've been dealing with some R10 data and I want to know if there are any plans to make some improvements of minimap2 on ONT R10 in the next few months? Or any new suggestions for R10 data?

iiSeymour commented 7 months ago

@lh3 from our internal benchmarking we find speed and downstream accuracy are maximized with -x map-ont -k19 -w 19 -U50,500 -g10k.

Mon3trK commented 7 months ago

For now, use map-ont. You can try -x map-hifi -w10 (HiFi scoring and k-mer length with more seeds) for Q20 reads but you need to have a way to evaluate whether that gives better results.

I hope I can find some time in the next several months to improve minimap2 a little bit. Along this I will be testing alternative scoring for v14 data.

Hi @lh3, accuracy of ONT sequencing has advanced a lot from duplex or R10.4 pore. I also wonder if there is any plan for setting different preset for R9 and R10 nanopore? And also different basecallers have significant impact on sequencing accuracy, it seem unappropriate to just mixed in -x map-ont.

lh3 commented 7 months ago

from our internal benchmarking we find speed and downstream accuracy are maximized with -x map-ont -k19 -w 19 -U50,500 -g10k.

-x map-hifi is equivalent to -x map-ont -k19 -w 19 -U50,500 -g10k -A1 -B4 -O6,26 -E2,1 -s200. The main difference here is the scoring. How scoring affects the downstream tools? If the map-hifi scoring also works, I can add an alias to map-hifi, something like lr:hq.

also different basecallers have significant impact on sequencing accuracy

That is why it is more appropriate to choose a conservative setting that can give you good results on input of varying quality.

iiSeymour commented 7 months ago

If the map-hifi scoring also works

Unfortunately not, the map-hifi scoring leads to both fewer mapped reads (~3%) and small regressions in SNP/INDEL calling. It's possible these regressions could be recovered from new models trained on updated scoring parameters but it seems -x map-ont -k19 -w 19 -U50,500 -g10k is the sweet spot.

lh3 commented 7 months ago

The next release will have a lr:hq preset for -k19 -w 19 -U50,500 -g10k.

bepoli commented 7 months ago

Thanks @lh3 ! I understand that the new preset lr:hq is not meant for spliced alignment. Should I use the existing preset splice:hq with highly accurate Nanopore cDNA reads? (with average quality >= 20)

lh3 commented 7 months ago

Yes

lh3 commented 7 months ago

I will hijack the thread and ask a question here: are there public Q20 cDNA-seq data? Perhaps because the SQK-PCS114 kit still at the early-access stage, most cDNA reads in papers were produced with R9 or older kits.

dolittle007 commented 7 months ago

Hi @lh3, I have PacBio HiFi Iso-Seq data, should I use the existing preset splice:hq along with the new preset lr:hq, or I can just use -k19 -w 19 -U50,500 -g10k -xsplice -C5 -O6,24 -B4? Thanks a lot.

jelber2 commented 7 months ago

The next release will have a lr:hq preset for -k19 -w 19 -U50,500 -g10k.

Shouldn't it be -x map-ont -k19 -w 19 -U50,500 -g10k ? According to @iiSeymour

FatYuanBao commented 6 months ago

@iiSeymour I noticed the latest Minimap2-2.27 (r1193) includes an updated lr:hq preset. I conducted a small benchmark between this new preset and the old map-ont preset on a human R10.4.1 database using dorado 0.4.1 in HAC mode.

For -x map-ont:

19072496 + 0 mapped (99.93% : N/A) 12791592 + 0 primary mapped (99.90% : N/A)

For -x lr:hq:

18636130 + 0 mapped (99.79% : N/A) 12765068 + 0 primary mapped (99.69% : N/A)

It appears that there are fewer mapped reads (~0.14%) with the new lr:hq preset. Considering the relatively high coverage (>50X) of this data, this difference could be significant.

lh3 commented 6 months ago

Read count-based metrics are often misleading. The difference mostly comes from short reads and low-quality reads that may interfere with analyses on the contrary. PS: also, not all reads are supposed to get mapped to a reference genome.

dolittle007 commented 6 months ago

The next release will have a lr:hq preset for -k19 -w 19 -U50,500 -g10k.

Shouldn't it be -x map-ont -k19 -w 19 -U50,500 -g10k ? According to @iiSeymour

Thanks a lot. @jelber2 splice:hq works for RNA and lr:hq works for DNA.

preset lr:hq => -x map-ont -k19 -w 19 -U50,500 -g10k preset splice:hq => -x splice -C5 -O6,24 -B4 preset splice => -x map-ont -k15 -w5 --splice -g2k -G200k -U10,1000000 -A1 -B2 -O2,32 -E1,0 -b0 -C9 -z200 -ub --junc-bonus=9 --cap-sw-mem=0 --splice-flank=yes

So parameters from lr:hq and splice:hq will cause conflicts.

camillaugolini-iit commented 2 weeks ago

Hello,

@lh3 and @iiSeymour, as far as I understood, splice:hq is the best option for R10 Nanopore cDNA reads. Would it be optimal also for the new RNA004 ? In other words, which setting would you use to optimally align reads from the new RNA pore to a genomic and a transcriptomic reference?

Thank you for your time

camillaugolini-iit commented 2 weeks ago

Also, if provided a --junc-bed file, would this have any conflict with the splice:hq options?

dolittle007 commented 2 weeks ago

@camillaugolini-iit Using the --junc-bed option, minimap2 prioritizes splicing events based on the provided annotations. It will not cause any conflict with splice:hq options.