Open ekg opened 8 years ago
Containment has almost no information on the connective of the graph. Dropping it is a standard procedure. What do you expect from the assembly?
These are a few haplotypes across a gene in HLA. I would have expected them to form a single connected graph. This is what happens when I pairwise align them. I'll share a result in GFA to clarify. On Nov 28, 2015 4:38 PM, "Heng Li" notifications@github.com wrote:
Containment has almost no information on the connective of the graph. Dropping it is a standard procedure. What do you expect from the assembly?
— Reply to this email directly or view it on GitHub https://github.com/lh3/miniasm/issues/5#issuecomment-160312172.
Overlap-based assembly looks at head-to-tail overlaps between reads. Contained reads are dropped. Internal matches (i.e. non overlapping matches) are ignored. If you have n haplotypes in the same region, the assembly graph will have n singleton contigs with no edges, because there are no head-to-tail overlaps.
A resolution would be to sample long overlapping reads from the input sequences, so as to ensure the head to tail overlap criteria. If I understand correctly, something else might need to be done to ensure there is not "chew back" at the head and tail of the assembly.
The assembly includes approximate overlaps and containments. We'd like to find the small variants in these, rather than assume equality. So I think we need to work from the PAF files.
If you don't want a random read to be picked for the assembly path, then it's probably not a good idea to use miniasm. Miniasm is great for scaffolding, but not good for finding variants because it makes no attempt to correct base-calling errors.
Coming back to this. In principle we can smooth the assembly graph using vg call. I'll be testing this.
I'm curious if miniasm works for the assembly of multiple high-quality sequences. For instance, the GRCh38 ALTs that are being used in the graph challenge in the ga4gh.
So, we tried to assemble some short genes in the MHC. I store some in the vg/test directory. For instance,
reads.gfa
is empty.It looks like the mapping works as expected,
but the graph shrinks dramatically during "containment removal":
Out of curiosity, I poked around in the code to try to get a sense of the state of the assembly graph at the point where containment reduction happens, but I don't have a good enough sense of how it is working to know what I'm looking at.
Have you tried this kind of assembly with miniasm? Is it possible? If so how should miniasm be parameterized to get it to happen?
I think it would be very useful to get this going. In the abstract, it seems it should work. Tolerating high error rates between between reads is analogous to the same problem between homologous but divergent haplotypes.