nanoporetech / dorado

Oxford Nanopore's Basecaller
https://nanoporetech.com/
Other
521 stars 62 forks source link

Request for Guidance on Using Minimap2 for All-vs-All Read Alignment in Dorado Correct #1058

Closed sebeier closed 1 day ago

sebeier commented 1 month ago

First, I’d like to express my appreciation for this tool! I am currently using dorado (0.8.0) and have encountered an issue during the first step of the all-vs-all read alignment. The problem seems to be that the requested memory resources are too demanding for my current computational environment (our lab is using a SGE cluster, with individual cluster nodes of a maximum memory of 2 TB).

To work around this, I am considering computing the all-vs-all alignment using minimap2 as an alternative. However, I am uncertain about the exact parameters or settings that the inference part of dorado correct expects when consuming the resulting PAF file.

Could you clarify the following points:

  1. What specific settings or parameters should be passed to minimap2 to ensure compatibility with the inference part of dorado correct when generating the PAF file for all-vs-all alignment?

    Specifically: Preset or sequence type to use (I assume -x ava-ont)? Matching scoring settings? Any other alignment-specific options (-c for the CIGAR string, only primary alignments?)?

  2. Are there any post-processing steps required on the PAF file before feeding it into Dorado Correct?

I would greatly appreciate any guidance or recommendations on how to approach this, as I’d like to ensure the alignment data I provide remains compatible with dorado correct's downstream steps.

Thank you for your support and for maintaining this great tool!

diego-rt commented 1 month ago

I also would appreciate knowing how to run the alignments outside of dorado.

I tried aligning reads 'manually' using minimap2 and the HERRO recommended parameters but this leads to dorado correct failing:

minimap2 -K8g -cx ava-ont -k25 -w17 -e200 -r150 -m2500 -z200 -f 0.005 -t 16 --dual=yes ont_reads.fastq ont_reads.fastq > aln.paf

dorado correct -t 4 -v ont_reads.fastq --from-paf aln.paf > corrected_reads.fasta
INFO:    Using cached SIF image
[2024-10-02 16:59:42.393] [info] Running: "correct" "-t" "4" "-v" "ont_reads.fastq" "--from-paf" "aln.paf"
[2024-10-02 16:59:42.606] [debug] Aligner threads 4, corrector threads 4, writer threads 1
[2024-10-02 16:59:42.619] [info]  - downloading herro-v1 with httplib
[2024-10-02 16:59:42.993] [debug] furthest_skip_header = '', furthest_skip_id = -1
[2024-10-02 16:59:43.233] [debug] Usable memory for dev cuda:0: 15.2 GB
[2024-10-02 16:59:43.233] [info] Using batch size 12 on device cuda:0 in inference thread 0.
[2024-10-02 16:59:43.233] [debug] Usable memory for dev cuda:0: 15.2 GB
[2024-10-02 16:59:43.233] [info] Using batch size 12 on device cuda:0 in inference thread 1.
[2024-10-02 16:59:43.235] [debug] Starting process thread for cuda:0!
[2024-10-02 16:59:43.235] [debug] Starting process thread for cuda:0!
[2024-10-02 16:59:43.235] [debug] Looking for idx ont_reads.fastq.fai
[2024-10-02 16:59:43.238] [debug] Starting decode thread!
[2024-10-02 16:59:43.239] [debug] Starting decode thread!
[2024-10-02 16:59:43.240] [debug] Starting decode thread!
[2024-10-02 16:59:43.240] [debug] Starting decode thread!
[2024-10-02 16:59:43.241] [info] Starting
[2024-10-02 16:59:43.447] [debug] Loading model on cuda:0...
[2024-10-02 16:59:43.447] [debug] Loading model on cuda:0...
terminate called recursively
terminate called recursively
terminate called recursively
HalfPhoton commented 3 weeks ago

@sebeier @diego-rt The dorado correct minimap2 presets are here@CorrectionMapperNode.cpp#L279-L291.

    auto options = alignment::create_preset_options("ava-ont");
    auto& index_options = options.index_options->get();
    index_options.k = 25;
    index_options.w = 17;
    index_options.batch_size = index_size;
    auto& mapping_options = options.mapping_options->get();
    mapping_options.bw = 150;
    mapping_options.bw_long = 2000;
    mapping_options.min_chain_score = 4000;
    mapping_options.zdrop = 200;
    mapping_options.zdrop_inv = 200;
    mapping_options.occ_dist = 200;
    mapping_options.flag |= MM_F_EQX;

    // --cs short
    alignment::mm2::apply_cs_option(options, "short");

    // --dual yes
    alignment::mm2::apply_dual_option(options, "yes");

A critical setting is mapping_options.flag |= MM_F_EQX; which is set in mm2 via --eqx

--eqx | Output =/X CIGAR operators for sequence match/mismatch.

This must be set for dorado correct to function.

diego-rt commented 3 weeks ago

Ah fantastic! Thanks a lot @HalfPhoton !

Just a couple last questions... Is there any relevant filtering or sorting of the resulting PAF file? From my tests it seems like it is usually sorted on the target read name and start coordinate (cols 6 and 8)? Is there anything else?

Also, have you guys experimented with reducing the -f flag yet to filter the most frequent minimizers? Happy to hear about any experience you might have had with this already.

Thanks a ton!

HalfPhoton commented 3 weeks ago

@diego-rt, Yes there is a filter for the self-alignment here@CorrectionMapperNode.cpp#L42. Otherwise - things should be handled by dorado.

To the best of my knowledge we haven't explored the -f flag (it's set at the default value for now). This would be interesting though and we will likely look into this in future. If you do your own exploration please let us know how you get on.

diego-rt commented 3 weeks ago

Perfect, thanks a lot! I'll be testing that flag in the next days.

svc-jstone commented 3 weeks ago

Just to mention, although it is possible to use a third party mapper for correction, we strongly recommend using the dorado correct mapper for correcting the reads to ensure compatibility with the consensus stage (especially in case the internal parametrization or the algorithm change in the future).

sebeier commented 1 day ago

Thanks, this worked perfectly!