lh3 / miniasm

Ultrafast de novo assembly for long noisy reads (though having no consensus step)
MIT License
299 stars 68 forks source link

miniasm seems to not work with minimap2 -x ava-ont PAF files of latest minimap2 version? #32

Closed RaverJay closed 6 years ago

RaverJay commented 6 years ago

Using a .paf file generated with the latest minimap2 version by: minimap2 -x ava-ont ../reads.fq ../reads.fq > hcov_only_ava.paf

and trying to use miniasm on that: miniasm -f ../reads.fq hcov_only_ava.paf > hcov_only_asm.gfa

yields an empty (0 bytes) file.

Using minimap2 -> miniasm has worked for me before (about 1 month ago).

miniasm log: [M::main] ===> Step 1: reading read mappings <=== [M::ma_hit_read::27.5911.00] read 22177213 hits; stored 9112072 hits and 11204 sequences (32198655 bp) [M::main] ===> Step 2: 1-pass (crude) read selection <=== [M::ma_hit_sub::29.8381.00] 11042 query sequences remain after sub [M::ma_hit_cut::30.0281.00] 9104542 hits remain after cut [M::ma_hit_flt::30.3111.00] 8958079 hits remain after filtering; crude coverage after filtering: 521.79 [M::main] ===> Step 3: 2-pass (fine) read selection <=== [M::ma_hit_sub::31.1741.00] 10995 query sequences remain after sub [M::ma_hit_cut::31.3561.00] 8955787 hits remain after cut [M::ma_hit_contained::31.723*1.00] 23 sequences and 184 hits remain after containment removal [M::main] ===> Step 4: graph cleaning <=== [M::ma_sg_gen] read 74 arcs [M::main] ===> Step 4.1: transitive reduction <=== [M::asg_arc_del_trans] transitively reduced 32 arcs [M::asg_arc_del_multi] removed 0 multi-arcs [M::asg_arc_del_asymm] removed 6 asymmetric arcs [M::main] ===> Step 4.2: initial tip cutting and bubble popping <=== [M::asg_cut_tip] cut 12 tips [M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips [M::main] ===> Step 4.3: cutting short overlaps (3 rounds in total) <=== [M::asg_arc_del_short] removed 0 short overlaps [M::asg_arc_del_short] removed 0 short overlaps [M::asg_arc_del_short] removed 0 short overlaps [M::main] ===> Step 4.4: removing short internal sequences and bi-loops <=== [M::asg_cut_internal] cut 0 internal sequences [M::asg_cut_biloop] cut 0 small bi-loops [M::asg_cut_tip] cut 3 tips [M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips [M::main] ===> Step 4.5: aggressively cutting short overlaps <=== [M::asg_arc_del_short] removed 0 short overlaps [M::main] ===> Step 5: generating unitigs <=== [M::main] Version: 0.2-r168-dirty [M::main] CMD: miniasm -f ../reads.fq hcov_only_ava.paf [M::main] Real time: 32.243 sec; CPU: 32.240 sec

lh3 commented 6 years ago

Recent changes do affect read overlapping, but I don't remember big changes that may dramatically affect the results. Could you try a few historical versions of minimap2 from the release page and tell me the version that gives a better result? Thanks.

RaverJay commented 6 years ago

Hm, additional testing with all kinds of minimap / minimap2 versions yielded only empty .gfa files for this .fq file (real nanopore reads).

Testing on another .fq real reads sample yielded only one short contig (4.1kb - coronavirus genome is about 27kb)

Testing on aritificial long noisy reads with higher (and more evenly distributed) coverage yielded much better results.

So it seems it is not a bug / problem with miniasm, but rather the sample is unsuitable, I suspect the coverage is lacking. Still I would expect at least some contigs out of 17M overlaps between 72k reads?

lh3 commented 6 years ago

According to the miniasm log:

[M::main] ===> Step 3: 2-pass (fine) read selection <===
[M::ma_hit_sub::31.1741.00] 10995 query sequences remain after sub
[M::ma_hit_cut::31.3561.00] 8955787 hits remain after cut
[M::ma_hit_contained::31.723*1.00] 23 sequences and 184 hits remain after containment removal

The overwhelming majority of reads [(10995-23)/10995] are contained in 23 very long reads. When there are only 23 reads in the assembly graph, miniasm might behave in some unexpected ways – miniasm is not designed for this case. Do you have some very long reads?

RaverJay commented 6 years ago

Yes that is exactly the case. Two of the reads are actually 26kb+ so they span almost the complete genome.

lh3 commented 6 years ago

You may try:

miniasm -e2 -n1

It might help.

RaverJay commented 6 years ago

Yeah that helps, this yields three contigs that roughly represent the complete genome.

Thanks!