Ambiguous reference bases

@armintoepfer are you familiar with pan-genomic alignment methods? Typically they model the reference as a kind of graph and allow you to align your reads to walks through the graph. They might use a kind of DAG (if a VCF file and reference FASTA file are used to model the population), or an assembly graph such as those produced by some of @lh3's other projects.

HISAT2 is "a fast and sensitive alignment program for mapping next-generation sequencing reads (whole-genome, transcriptome, and exome sequencing data) to a population of human genomes (as well as to a single reference genome)."
gramtools is based around the vBWT, which allows the user to encode a one-level bubble graph from an appropriately-formatted VCF file; implements genome inference with the vBWT and realigns to the inferred genome using a standard aligner
vg implements a full resequencing analysis process where the reference system may be any kind of sequence graph. It uses GCSA2 to build a MEM-based aligner on the graph. In addition to alignment it implements in-graph variant calling, and genotyping. (disclosure: I work on this project.)

These systems attempt to solve exactly the kinds of problems that lead to the use of ambiguous reference bases. However, rather than only allowing SNPs, as ambiguous reference bases do, they allow all manner of variation into the reference model. For instance, a graph can encode ambiguity about indels, but this will be very difficult to do with a linear reference unless we implicitly encode such a graph.

These methods can also handle another problem that occurs in viral resequencing against a linear reference: a graph reference system can directly represent a circular genome.

lh3 / bwa

Ambiguous reference bases #88