Better normalization for insertions

glennhickey commented 4 years ago

Normalization as done by vg mod -n / -U is known to be limited. When @jmonlong uses it on a graph created with several similar insertions at the same location, it makes some attempt to collapse matching sequence shared between the insertions into the same node. But many cases, in particular matches following a mismatch, are left unhandled. Each duplicated sequence left in the graph makes the snarl that much harder to genotype.

The general normalization problem as tacked my mod is very difficult. But limiting to insertions during construction seems to be an easier problem: The insertion sequences can be aligned together, a graph constructed from the alignment, and hooked in instead of an insertion sequence. Is this something feasible to add to vg construct or mod?

ekg commented 4 years ago

On the call yesterday, I suggested this workflow to generate a normalized SV graph from VCF files:

VCF -> alleles + configurable reference context (say 10kb on each side)
map these allele sequences to the reference and to each other
take the sequences of the reference and PAF from (2) and input them to seqwish to generate a graph
sort/order the graph with odgi, emit a GFA and work from there

If care is taken to only align the sequences to the correct positions and neighboring alleles, then the graph will be overall linear, and look pretty much like the input VCF except that nearby variants with strong homology between them will be merged in the graph. This will happen even if there are small differences between the SV alleles. Those will presumably occur naturally anyway, so we need to be able to deal with them.

cgroza commented 4 years ago

I have a lot of Alu insertions with a small number of SNVs nested within their sequence, and I have been trying to collapse them for two weeks now. Maybe vg augment offers a way to avoid redundancy by iteratively aligning a new sequence to the graph and then augmenting the graph? Since I only have ~4000 seqeuences, I was thinking of progressively re-indexing the subgraph of the locus, aligning and augmenting the graph each time even if it is expensive.

ekg commented 4 years ago

If you can generate approximate local haplotypes including each of the alleles then it should be possible to merge them with pggb.

You'll need to make a sequence for each of your alu insertions including several 2kbp of sequence on each side taken from the reference. Then you can combine these and the reference genome locally in one FASTA and build the graph with pggb. The key thing is that the flanking sequence is long enough to uniquely map your alternate allele containing sequence onto the reference genome.

On Sun, Oct 11, 2020, 01:06 Groza Cristian notifications@github.com wrote:

I have a lot of Alu insertions with a small number of SNVs nested within their sequence, and I have been trying to collapse them for two weeks now. Maybe vg augment offers a way to avoid redundancy by iteratively aligning a new sequence to the graph and then augmenting the graph? Since I only have ~4000 seqeuences, I was thinking of progressively re-indexing the subgraph of the locus, aligning and augmenting the graph each time even if it is expensive.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/vgteam/vg/issues/2590#issuecomment-706623380, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABDQEO2FQPLTI7QJPODNWDSKDSF7ANCNFSM4KD33XOA .

cgroza commented 4 years ago

I am going to attempt that. I would also like to include smaller variants. Would it work to patch the entire reference genome with small variants and Alu insertions to create a personalized reference for each individual and then have these personalized references aligned to the reference in the manner you describe using pggb to produce the genome graph?

Thanks for the advice! Cristian Groza

vgteam / vg

Better normalization for insertions #2590