lh3 / bwa

Burrow-Wheeler Aligner for short-read alignment (see minimap2 for long-read alignment)
GNU General Public License v3.0
1.53k stars 556 forks source link

Ambiguous reference bases #88

Open armintoepfer opened 8 years ago

armintoepfer commented 8 years ago

Would it be possible to handle ambiguous reference bases properly? That would increase the alignment quality for viral use cases.

Thank you.

ekg commented 8 years ago

@armintoepfer are you familiar with pan-genomic alignment methods? Typically they model the reference as a kind of graph and allow you to align your reads to walks through the graph. They might use a kind of DAG (if a VCF file and reference FASTA file are used to model the population), or an assembly graph such as those produced by some of @lh3's other projects.

These systems attempt to solve exactly the kinds of problems that lead to the use of ambiguous reference bases. However, rather than only allowing SNPs, as ambiguous reference bases do, they allow all manner of variation into the reference model. For instance, a graph can encode ambiguity about indels, but this will be very difficult to do with a linear reference unless we implicitly encode such a graph.

These methods can also handle another problem that occurs in viral resequencing against a linear reference: a graph reference system can directly represent a circular genome.