Reference-based Alignment

averagehat commented 9 years ago

At scipy Evan and I talked about how a reference-based alignment might be represented. A related topic to think about is how one might represent de novo assemblies (e.g. contigs). SAM is what we most commonly work with reference-based alignments (their last update increased support for de novo representation). SAM spec is here: https://samtools.github.io/hts-specs/SAMv1.pdf. Lots of metadata stored. Also commonly used is samtools pileup, which outputs the stacked reads directly without the extra information. Some people are using the pysam library with .bam files. We like controlling our samtools binary version and so we parse out views/pileups

We most commonly use pileup/.sam to extract a consensus sequence (FASTA file) along with variant population information (VCF file) using simple statistics and quality information.

As per our discussion query sequence wants to know:

The start and end position it aligns to
where and how it differs from the reference (as a sparse matrix?) in addition to the meta- and quality data associated with a sequence.

ebolyen commented 9 years ago

To add to this discussion, what kinds of operations would we want on this object?

Like you said, majority consensus would be very useful. What about things like SNP frequencies for positions? How might we represent multiple insertions at one position?

What does indexing look like? Perhaps: refalign.positions[50] for the column at 50 of the reference and refalign.positions[50, ...] to include any insertions that may have occurred between [50, 51) but then what is returned in this case? And how do we deal with an insertion before position 0?

averagehat commented 9 years ago

I will keep up with this discussion more diligently.

Majority Consensus: yes, but in practice this is more complex than picking the most common base. I think it would be a great contribution if scikit-bio could provide a "view" of an alignment to make consensus/variant calling easier (esp. make it easy to integrate metadata). When I fetch a column of an alignment, I'd like to be able to fetch the metadata for the reads within that column alongside.

We do like statistics about a column and the overall alignment; i.e. #reads mapped, avg mapping quality, etc. We might like to find windows or positions that stand out as having very low coverage, quality, etc. We make plots of quality/depth over the position of the genome.

Things we consider in Consensus/variant calling: We look at stuff like overall number of bases, mapping/base quality of majority/minority base, paired/unpaired; we also call ambiguous if the majority is not significant enough.

Suggested "Views" of a reference-based alignment:

Bases (along with quality/mapping quality) along a column, for calling variants.
Bases (etc.) within a window (ie N columns), also for calling variants.
Sequences in the alignment given a filter (ie all reads above x mapping/avg base quality, reads where its pairs map, etc.)

I would expect refalign[50] to work as you suggest. We don't work a lot with insertions, so I'm not sure what the behaviour should be. I will say that most algorithms for variant calling that I have seen are written on top of something like bamtools (or other libraries for reading bams). As an aside, BAM files produced by an alignment will include reads that did not actually map at all.

One example use case that just occurred to me would be finding primer-induced errors. Primer-induced errors can cause artificial SNP counts; but the SNPs will always occur near the ends of reads (this is how you know it's caused by the primer). So you might want to say, "give me all SNPs which occur with frequency >= 5%", then ask, "For each SNP, where in the read does the SNP occur?" (for another application, you might ask for the base quality at that base, or metadata about the sequence).

scikit-bio / scikit-bio

Reference-based Alignment #1014