Open huddlej opened 2 years ago
@huddlej were you thinking that the clustering would happen inside Java? If we were to do that we would probably want to bring in some library for that.
An alternative, as you say, would be to output all of the phenotypic information and then do clustering using scikit-learn in Python. At least perhaps that's the right first step?
@matsen That's a good point to clarify! It looks like Trevor originally applied clustering in a Mathematica notebook, so I think a scikit-learn approach would be a perfect first start for this issue. The Mathematica notebook could provide some direction about which data frames from antigen Trevor used for that clustering analysis.
Description
Building on the work in issue #22, output the number of cases per day, deme, and variant to support models like @marlinfiggins's Rt frequency dynamics models.
Example output looks like:
See recent variant counts for the USA, for a complete example.
Possible solution
For SARS-CoV-2, "variants" are already well defined as phylogenetic lineages of interest. The closest analog in antigen would be a specific phenotype or a cluster of phenotypes in antigenic space. In @trvrb's original paper, he clustered phenotypes in 2D space as shown below in the bottom right panel:
To support this output, we may need to implement similar clustering logic that will group phenotypes into consistent lineages through time. Alternately, we could output cases per specific phenotype (potentially generating hundreds of different "variants").
We might implement this output as part of the same "case counts" output mentioned in #22 or as a separate file. We might also consider whether we want to parameterize how these variants are sampled to recreate the sampling bias present in real data where not all cases can be sequenced.