emmahodcroft commented 2 years ago

A few people have commented that it could be interesting to incorporate case-count data. I presume we could pull this from Johns Hopkins or OWID? I also assume we'd want a standardizes count as used by OWID (x per million etc).

Two ways this could be incorporated (either, or both):

Second Y axis added to Per Country plots and the number of cases overplots the variants (can be turned on/off)
Multiply together case count & variant frequency to generate a plot showing % variants under the case-count curve (see figure in this paper - though this is per # sequences). Big caveat here is that we likely only want to make this possible for countries doing a good amount of sequencing. We'd probably want to generate some list of countries that have >=X sequences per case (or X sequences per population) and only allow these transformations on these countries. But for countries with good sequencing, this could be very interesting.

ivan-aksamentov commented 2 years ago

incorporate case-count data

I need to better understand the general idea of it. That is, what are we trying to demonstrate. The implementation, especially the visualization, will be based on this idea.

Currently I think we can split the problem into multiple tasks:

[x] 1. Fetch raw data from JHU/OWID
[ ] 2. Perform relevant scientific transformations on raw data and produce output data that is ready to be rendered. Let's call it the "web data" as we usually do.
[ ] 3. Render the web data

For 1:

That's the somewhat easy and straightforward part.

We've been dealing with OWID data + some country-specific data in the scenarios project. @nnoll have set up data fetchign once and it's been going for almost 2 years now on its own: https://github.com/neherlab/covid19_scenarios/tree/master/data

We have a bot that tirelessly updates the data daily: https://github.com/neherlab/covid19_scenarios/pull/896 so once done, it's a very low maintenance thing.

The script that fetches data from OWID is in: https://github.com/neherlab/covid19_scenarios/blob/e9c1bdc771f0ee7f6c6b370cebed0fdc9061ff14/data/parsers/owid.py

The URL being used is https://covid.ourworldindata.org/data/owid-covid-data.json

The actual resulting tsv data is in data/case-counts/owid, plus some more elaborate data for a few countries that have it in data/case-counts/<country_name>.

There's also data on country population, age distribution etc.

We could steal most of it and adapt for the needs of CoVariants.

For 2:

I am not sure what needs to be done there. If there's anything, then @emmahodcroft is probably the best person to do it.

The things like

Multiply together case count & variant frequency to generate a plot showing

will go there probably. Except instead of generating plot, it would generate JSON data for step 3 to render.

For 3:

There are may ways to render this data. Needs some thought. Also the current plots are very busy already, so things like second axis will be challenging.

Questions:

Second Y axis added to Per Country plots and the number of cases overplots the variants (can be turned on/off)

This will look like the second Y axis for "Per variant" plots, right?

Multiply together case count & variant frequency to generate a plot showing % variants under the case-count curve

That would look like a second X axis for the "Per variant" plots, right?

emmahodcroft commented 2 years ago

Thanks for looking at this Ivan!

Yes, I was hoping step 1 would be fairly straightforward to implement, and am glad to hear it might be. If you want, you could get this part working to create in a new folder - I can then pull down the OWID case data and start playing around with how to "do step 2" as part of my scripts, and end up with the same types of file as we use for the other charts. Is it possible to go ahead and get the data which is already standardized to X per million population (or whatever OWID uses)? I don't think we have any reason to want to have to mess with anything but this.

I think possibly the most popular thing (and perhaps easiest) would be to show a plot of the cases with the area under the curve coloured by variant (a bit like the chart below, though I'd be showing per case instead of per sequence, I suppose). We could do this two ways, without having to worry about extra axes (as the point that the graphs are already quite crowded is well taken).

One way to tackle step 3 would be to add another page to website and thus also to the top-bar of the website. "By cases" or similar, for example. Then, this would show the same type of charts but with the Y axis being cases, and then colouring the area under the case count by the % variant. This might be slightly more straightforward but would mean another entry along the top of the page. We should only display countries that have passed some kind of test - likely I'll set in my own scripts to only generate data for countries with at least X sequences per case or similar.

Another way would be to add a switch to the Per Country page and allow users to switch between the current view, and the view described above - where the percents switch to being under the case count curve. This would mean we don't need an extra page, and allow more direct comparison to the two views, but I imagine would be more complex. In this case, we would want to show blank or greyed-out graphs for any countries for which I haven't generated data (because I consider the estimate to be not based on enough sequences).

ivan-aksamentov commented 2 years ago

I started the work in

hodcroftlab / covariants

Incorporate Case Count Data #243

249