A data structure to contain mutations

I just think a data frame is just great.

Columns:

germ_allele: the allele that is matched using partis or whatever
germ_match_boundary: the end of the germline match
position: where the mutation is
new_state: what the new state is

This can be in "melted" format where the first two are repeated as much as needed. Is this a problem?

A really big data set is 1 million sequences. Each of these sequences may have 20 mutations in V. This already seems reasonable for manipulation with pandas or R.

I think we are mostly interested in V. If so, then we won't have 1 million unique V portions of our reads.

We can index the germline alleles with integers rather than strings to save space.

matsengrp / TemplatedMutagenesis-1

A data structure to contain mutations #4