germ_allele: the allele that is matched using partis or whatever
germ_match_boundary: the end of the germline match
position: where the mutation is
new_state: what the new state is
This can be in "melted" format where the first two are repeated as much as needed. Is this a problem?
A really big data set is 1 million sequences. Each of these sequences may have 20 mutations in V. This already seems reasonable for manipulation with pandas or R.
I think we are mostly interested in V. If so, then we won't have 1 million unique V portions of our reads.
We can index the germline alleles with integers rather than strings to save space.
I just think a data frame is just great.
Columns:
This can be in "melted" format where the first two are repeated as much as needed. Is this a problem?
A really big data set is 1 million sequences. Each of these sequences may have 20 mutations in V. This already seems reasonable for manipulation with pandas or R.
I think we are mostly interested in V. If so, then we won't have 1 million unique V portions of our reads.
We can index the germline alleles with integers rather than strings to save space.