matsengrp / TemplatedMutagenesis-1

0 stars 0 forks source link

A data structure to contain mutations #4

Closed matsen closed 5 years ago

matsen commented 7 years ago

I just think a data frame is just great.

Columns:

This can be in "melted" format where the first two are repeated as much as needed. Is this a problem?

A really big data set is 1 million sequences. Each of these sequences may have 20 mutations in V. This already seems reasonable for manipulation with pandas or R.

I think we are mostly interested in V. If so, then we won't have 1 million unique V portions of our reads.

We can index the germline alleles with integers rather than strings to save space.