lh3 / miniasm

Ultrafast de novo assembly for long noisy reads (though having no consensus step)
MIT License
299 stars 68 forks source link

assembling short references #5

Open ekg opened 8 years ago

ekg commented 8 years ago

I'm curious if miniasm works for the assembly of multiple high-quality sequences. For instance, the GRCh38 ALTs that are being used in the graph challenge in the ga4gh.

So, we tried to assemble some short genes in the MHC. I store some in the vg/test directory. For instance,

wget https://raw.githubusercontent.com/ekg/vg/master/test/GRCh38_alts/FASTA/HLA/A-3105.fa
minimap -Sw5 -L100 -m0 -t8 A-3105.fa A-3105.fa | gzip -1 > reads.paf.gz
miniasm -f reads.fq reads.paf.gz > reads.gfa

reads.gfa is empty.

It looks like the mapping works as expected,

➜  x  minimap -Sw5 -L100 -m0 -t8 A-3105.fa A-3105.fa | gzip -1 > reads.paf.gz
[M::mm_idx_gen::0.007*0.59] collected minimizers
[M::mm_idx_gen::0.014*1.47] sorted minimizers
[M::main::0.014*1.47] loaded/built the index for 11 target sequence(s)
[M::main] max occurrences of a minimizer to consider: 16
[M::main] Version: r122
[M::main] CMD: minimap -Sw5 -L100 -m0 -t8 A-3105.fa A-3105.fa
[M::main] Real time: 0.038 sec; CPU: 0.064 sec

but the graph shrinks dramatically during "containment removal":

➜  x  miniasm -f reads.fq reads.paf.gz > reads.gfa
[M::main] ===> Step 1: reading read mappings <===
[M::ma_hit_read::0.000*0.00] read 186 hits; stored 211 hits and 11 sequences (147718 bp)
[M::main] ===> Step 2: 1-pass (crude) read selection <===
[M::ma_hit_sub::0.000*0.00] 11 query sequences remain after sub
[M::ma_hit_cut::0.000*0.00] 111 hits remain after cut
[M::ma_hit_flt::0.000*0.00] 111 hits remain after filtering; crude coverage after filtering: 7.41
[M::main] ===> Step 3: 2-pass (fine) read selection <===
[M::ma_hit_sub::0.000*0.00] 11 query sequences remain after sub
[M::ma_hit_cut::0.000*0.00] 111 hits remain after cut
[M::ma_hit_contained::0.000*0.00] 2 sequences and 2 hits remain after containment removal
[M::main] ===> Step 4: graph cleaning <===
[M::ma_sg_gen] read 2 arcs
[M::main] ===> Step 4.1: transitive reduction <===
[M::asg_arc_del_trans] transitively reduced 0 arcs
[M::main] ===> Step 4.2: initial tip cutting and bubble popping <===
[M::asg_cut_tip] cut 1 tips
[M::asg_arc_del_multi] removed 0 multi-arcs
[M::asg_arc_del_asymm] removed 0 asymmetric arcs
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 4.3: cutting short overlaps (3 rounds in total) <===
[M::asg_arc_del_short] removed 0 short overlaps
[M::asg_arc_del_short] removed 0 short overlaps
[M::asg_arc_del_short] removed 0 short overlaps
[M::main] ===> Step 4.4: removing short internal sequences and bi-loops <===
[M::asg_cut_internal] cut 0 internal sequences
[M::asg_cut_biloop] cut 0 small bi-loops
[M::asg_cut_tip] cut 0 tips
[M::asg_pop_bubble] popped 0 bubbles and trimmed 0 tips
[M::main] ===> Step 4.5: aggressively cutting short overlaps <===
[M::asg_arc_del_short] removed 0 short overlaps
[M::main] ===> Step 5: generating unitigs <===
[M::main] Version: r104
[M::main] CMD: miniasm -f reads.fq reads.paf.gz
[M::main] Real time: 0.001 sec; CPU: 0.000 sec

Out of curiosity, I poked around in the code to try to get a sense of the state of the assembly graph at the point where containment reduction happens, but I don't have a good enough sense of how it is working to know what I'm looking at.

Have you tried this kind of assembly with miniasm? Is it possible? If so how should miniasm be parameterized to get it to happen?

I think it would be very useful to get this going. In the abstract, it seems it should work. Tolerating high error rates between between reads is analogous to the same problem between homologous but divergent haplotypes.

lh3 commented 8 years ago

Containment has almost no information on the connective of the graph. Dropping it is a standard procedure. What do you expect from the assembly?

ekg commented 8 years ago

These are a few haplotypes across a gene in HLA. I would have expected them to form a single connected graph. This is what happens when I pairwise align them. I'll share a result in GFA to clarify. On Nov 28, 2015 4:38 PM, "Heng Li" notifications@github.com wrote:

Containment has almost no information on the connective of the graph. Dropping it is a standard procedure. What do you expect from the assembly?

— Reply to this email directly or view it on GitHub https://github.com/lh3/miniasm/issues/5#issuecomment-160312172.

lh3 commented 8 years ago

Overlap-based assembly looks at head-to-tail overlaps between reads. Contained reads are dropped. Internal matches (i.e. non overlapping matches) are ignored. If you have n haplotypes in the same region, the assembly graph will have n singleton contigs with no edges, because there are no head-to-tail overlaps.

ekg commented 8 years ago

A resolution would be to sample long overlapping reads from the input sequences, so as to ensure the head to tail overlap criteria. If I understand correctly, something else might need to be done to ensure there is not "chew back" at the head and tail of the assembly.

ekg commented 8 years ago

The assembly includes approximate overlaps and containments. We'd like to find the small variants in these, rather than assume equality. So I think we need to work from the PAF files.

gringer commented 7 years ago

If you don't want a random read to be picked for the assembly path, then it's probably not a good idea to use miniasm. Miniasm is great for scaffolding, but not good for finding variants because it makes no attempt to correct base-calling errors.

ekg commented 6 years ago

Coming back to this. In principle we can smooth the assembly graph using vg call. I'll be testing this.