nextstrain / nextfrequencies

0 stars 0 forks source link

Geographic filtering #1

Open trvrb opened 3 years ago

trvrb commented 3 years ago

I really like the general direction and having the prototype is super helpful. My immediate main use case / interest is in situations like B.1.526 ala "New York variant" which is of interest due to constellation of spike mutations alongside recent rise in New York:

newyork-logistic-B 1 526

At the moment, there are 748 B.1.526 viruses in GISAID, but if we look at this in current Nextstrain we have just 2 in the North American build and just 37 in the SPHERES New York build. This makes accurate estimation of frequencies quite difficult.

So, in the "nextfrequencies" case, I'd want a JSON with enough granularity to filter to country USA or division New York and look at the frequency of B.1.526 (or the frequency of 253G+484K).

I would first think this could be accomplished by adding "region", "country" and "division" columns to the list of "traits" in the data/frequencies.json file and then exposing the ability in the app to both (1) "group by" and (2) "filter by" elements in "traits".

In this case you'd hope that the resulting JSON wouldn't be too bloated by splitting "haplotypes" based on geography. However, this seems doubtful as we have 1411 divisions currently categorized and you could easily imagine a >100X increase in JSON size from incorporating division.

Thus, it seems necessary to pre-build a series of JSONs filtered to various geographies. This would be quick and wouldn't be difficult to serve a number of different JSON files of the sort of:

- ncov_north-america_frequencies.json
- ncov_USA_frequencies.json
- ncov_New-York_frequencies.json

and then we just need an interface to select JSON file of interest.

And as discussed you could imagine an interface to compare multiple JSON files, which could "color by" the same type across multiple frequency panels. This could expose things like B.1.1.7 frequencies across multiple countries or could compare clade frequency predictions across different models. This approach is nice in that you can treat geographic "filtering" in the same fashion as different prediction models.

Does this seem like a reasonable approach?

trvrb commented 3 years ago

Also, I think it's pretty instructive to look at how the well thought out covidcg.org does things. This provides frequencies of clades, lineages and AAs across the entire genome (one at a time). It's fully expressive in terms of filtering by geography, but this means a very long list of check boxes for different regions, countries and admin divisions. I here how you can get to exactly the view you're interested in. However, it's too much clicking much of the time.

Here is S:253G across a few different states:

253G across states

Here is B.1.526 in New York:

B 1 526 in NY