Open MoiraZuber opened 2 years ago
The latest updates on your projects. Learn more about Vercel for Git ↗︎
Name | Status | Preview | Updated |
---|---|---|---|
covariants | ✅ Ready (Inspect) | Visit Preview | Mar 23, 2023 at 4:52PM (UTC) |
The current script produces output by multiplying the bi-weekly case counts data with the mutation proportions queried from covSPECTRUM and adding up the results across different clusters. There are two output files created in web/data
: perCountryMutationAnalysis.json
which sorts the mutation proportions by country, and perMutationAnalysis.json
, which sorts by mutation.
Points of discussion:
"type" == "variant"
and "graphing" == True
.clusters.py
are currently not used.
-> Which mutations to we use and how do we sort mutations by importance?Advanced points:
Thank you for this Moira! This is a great start and I really appreciate your work!
Which clusters to query for. Currently, the list of clusters is produced from all entries in clusters.py that have "type" == "variant" and "graphing" == True.
I think this is the right set to query for ✔️
How to query from covSPECTRUM: Currently, mutations are obtained for a certain cluster by querying via snps. Is this reliable enough or would switching to Pango lineages be better? Are there other methods with which to query by cluster?
We should actually be able to query Nextstrain clades, which I think might be a nice starting point as we can avoid the nitty-gritty for the time being and see if we need to wade into it later. Should be in the format like "21L" (without the '(Omicron)') - here's a web link as an example: https://cov-spectrum.org/explore/Switzerland/AllSamples/Past6M/variants?nextstrainClade=21L&
We can pull this directly from nextstrain_name
in clusters.py
, which also omits the extra bit...... ok apparently I've switched format here over time... 🙁
I think we should convert nextstrain_name
to just be the number-letter name and then we can use this. For example for Alpha, it's currently 20I (Alpha, V1)
and we should change this to 20I
. This also lines us up better for upcoming changes in Nextstrain.
Mutations are currently queried with an arbitrary threshold (10%).
This seems fine for now, I think we're most interested in the more common mutations people encountered
Defining mutations from clusters.py are currently not used.
I actually think this should be ok too.... since hopefully all defining mutations are in there and at high percents!
-> Which mutations to we use and how do we sort mutations by importance?
Let's get a list of the positions in S in the receptor binding domain (RBD) and N terminal domain (NTD) and pull just these out - they should be the most critical.
Which countries to use? Could we average over neighboring countries?
Ohhh, interesting point. Let's keep that one on the burner. For now how about starting with the list I put in the slack?
Visualize results -> Plot by country or mutation? Currently, both files are produced.
Hm, good to produce both as I'm not sure what we may want long term, but intuitively I think it would be interesting to plot by country first - what's the estimated breakdown of which mutations have been seen?
At the moment I'm not sure whether this will go on the site or not, so would you mind adding some simple plots of the files you're generating - so it outputs them as image files? No need for anything too fancy - but this will give us an idea as we go of what we're looking at!
I see you've got a bi-weekly mutation percent, I wonder if we should also try querying CoV-Spectrum by intervals (you can specify dates in the query) or we are doing good enough already - what do you think? Could try both and see if it makes a difference, perhaps?
Do you think that gives some good next steps?
When querying with nextstrainClade (extracted with clusters[clus]["display_name"].split(" ")[0]
to remove the part in parentheses) instead of snps, the following clusters yield valid responses:
Alpha (nextstrainClade 20I), Beta (nextstrainClade 20H), Gamma (nextstrainClade 20J), Delta (nextstrainClade 21A), 21I.Delta (nextstrainClade 21I), 21J.Delta (nextstrainClade 21J), 21K.Omicron (nextstrainClade 21K), 21L.Omicron (nextstrainClade 21L), 22A (nextstrainClade 22A), 22B (nextstrainClade 22B), 22C (nextstrainClade 22C), 22D (nextstrainClade 22D), Kappa (nextstrainClade 21B), Eta (nextstrainClade 21D), Iota (nextstrainClade 21F), 21GLambda (nextstrainClade 21G), 21H (nextstrainClade 21H), EU1 (nextstrainClade 20E), Epsilon (nextstrainClade 21C)
These are not found on covSPECTRUM:
20BS732 (nextstrainClade 20B/S:732A), 20AS126 (nextstrainClade 20A/S:126A), S439 (nextstrainClade 20A/S:439K), S677HRobin1 (nextstrainClade S:677H.Robin1), S677PPelican (nextstrainClade S:677P.Pelican), EU2 (nextstrainClade 20A.EU2), S98 (nextstrainClade 20A/S:98F), S80 (nextstrainClade 20C/S:80Y), S626 (nextstrainClade 20B/S:626S), S1122 (nextstrainClade 20B/S:1122L)
@emmahodcroft Do these missing ones have different nextstrainClades than what is in clusters[clus]["display_name"]
or do they just not have a counterpart on covSPECTRUM?
Let's get a list of the positions in S in the receptor binding domain (RBD) and N terminal domain (NTD) and pull just these out - they should be the most critical.
Where could I get this list from?
I see you've got a bi-weekly mutation percent, I wonder if we should also try querying CoV-Spectrum by intervals (you can specify dates in the query) or we are doing good enough already - what do you think? Could try both and see if it makes a difference, perhaps?
I got this running (querying by 2-week intervals), however, this takes a lot of time. I was wondering whether we could get the data downloaded directly sorted by date instead of querying for each interval separately, but I couldn't figure out whether this is possible. Do you by chance know?
Here's a few example plots:
Sorry Moira, somehow I missed this completely. Trying to come back to it a little now!
I think one step forward that's probably worth making it simplifying the Nextstrain names. However, I can't seem to get around to doing that so I've done the next best thing - I've added nextstrain_clade
which is guaranteed to just be the two-digits-and-letter (ex: 21A). This should give the same results but it's just a bit tidier and more reliable.
For the clusters that don't have Nextstrain names and so don't return hits, we can try to use Pango linages... I've added this and am seeing if that works. Probably not so critical to include these as they probably had few mutations that we're really interested in now, but if there's a simple way to include them, why not... I think this works.
Where could I get this list from?
I think to keep it simple for now, I'm happy to go with a legit-looking-paper's definition. For RBD this article includes "Thr333–Gly526 of the SARS-CoV-2 RBD", this article says "we identified the region of SARS-CoV-2 RBD at residues 331 to 524 of S protein".
For NTD, this article says "residues 15–305 of the SARS-CoV-2 spike protein ... were aligned against the NTD sequences", whereas this one says "NTD (aa 13–318)"
So I think anything in this ranges works! I have randomly picked 333-526 and 15-305 for the moment... So currently it only stores mutations in these two locations.
I got this running (querying by 2-week intervals), however, this takes a lot of time
Hm, I've now forgotten what the original get_mutation_proportions_biweekly
did... But I can see how doing 2-weekly queries takes a lot of time. I will see if I can figure out how this worked originally, if there's already a way to break things down by some kind of time element to see proportions changing over time... I suppose we know the proportion of the variant itself over time already, from CoV, right - so perhaps we can just use this? Multiplying the variant % by the mut % for that variant...
The only downside is if a variant got a particular mutation later on, and case numbers changed, then this will change...
I can also try to ask Chaoran if there's a better way, but I feel like I need to try and wrap my head around the scripts a little more before I formulate a question!
Going through the script I'm a little confused about the role country
is playing in get_mutation_proportions
-- it doesn't seem to go into the URL building, so are we getting different data for each country? Your plots show differences between countries so I figure something is happening, but I'm not 100% piecing it together.... but when I'm running pieces of the script, I'm getting the same proportion for each country 😅
I wonder if perhaps there's one more edit that didn't get pushed? (this might be messy now, since I'm about to push!)
I added something to include country
in the build_url
query which seems to work, I think...!
I got this running with the changes, through to making the plots (🎉)
I think unfortunately my take at the moment is that the case counting is probably messing us up quite bad. The proportion of cases detected vs real cases will vary from country to country, but perhaps it's more than I was anticipating.
For example, Denmark & Sweden had less 21K/BA.1 than most other countries, and bigger 21L/BA.2 waves. 21K has 144-, whereas 21L (and basically everything except 21K in Omicron) has 213G.
But if we look at 144- (proxy for 21K), you can see that Sweden/Denmark have as much exposure as on average in other countries... But I guess this may still be possible even if they had a smaller wave...? It's sensible that Portugal has more (from looking at waves/cases) and S Africa has less....
If we look at 213G (proxy for 21L + then everything after that), we'd expect Denmark and Sweden to be on top. Denmark is - and Sweden maybe was pretty high in early 2022 (see the bump in early 2022) - but it's quite far from Denmark. I suppose the later rises are the waves of subsequent Omicron in other countries.
Interestingly one can compare this to mutation 213G (all Omicron after 21K/BA.1) with 142D (same, except it was also in 21A.Delta...) - you can see how much longer the tails are to the right, due to Delta waves.
So this tells me we are on the right track - I can see the patterns - but I think the scaling up of cases is possibly more important than I thought... as otherwise we presume that Denmark is basically the best-prepared country out there for every mutation 🙃 Will try and think if there's a way to tackle this without it becoming completely unfeasible...
I really did forget to query for country. 🙈 How silly of me. Thanks for catching that! Interestingly it doesn't seem to make too much of a difference. I compared some of the plots, and most of them look rather similar. Here's one with a stronger difference I found: (pic above is new, pic below is old queries)
This means that a given variant won't vary strongly in between countries, right? I'm just wondering if we could somehow make use of this to make things more efficient... 🤔 But I guess it would still be too potentially interesting to catch those differences to discard per country queries entirely...
@MoiraZuber in the above push I just added the one thing I (apparently) hadn't pushed up before - an additional catch to check that pango_lineages
is present before trying to use it in the URL - if you could just check this seems sensible that would be great! I'm not even sure if we need it, but seemed better safe than sorry in case some don't have the Pango name.
From chatting with Richard he suggests we modify how we're storing the cases -- instead of storing the cumulative number of cases that have seen mutation X, he suggests storing, for each 2 week period the absolute number of cases that have seen mutation X. (So just the cases in that 2 week period, calculated from the mutation frequency * case numbers).
The advantage of this is that from this we can get everything else (such as cumulative, just by summing) and have more flexibility in how we can incorporate reinfection (and even potentially in future immune fading) in the future - as the absolute number shouldn't change - just then how many we end up putting to "cumulative" or similar. (So all more complex calculations can be post-hoc to the calculation of how many people saw mutation X in a week as a very raw value - hope that makes sense!)
Building on this, we can expect, certainly if we're working through 2022/Omicron, that we'll end up with more than 100% of the population infected - or that each person is infected more than once. (A rough value according to some studies in South Africa would be ~1.7 infections per person - but will vary per country a bit.)
Infection is Poisson distributed (rare event that happens randomly) so this then allows us to calculate the probability of reinfection as e^-1.7 (Or whatever the overall infection rate is for any given point in time) - to figure out how many of the "seeing mutation X" in a given week is "new" vs "reinfection" (aka don't add to the cumulative). _If we're excluding that having previously seen a mutation protects you from infection, which we are for now!!_
I hope this makes sense, but we can also chat through it! I'm still parsing it a bit in my brain too, but I think it sounds like a good idea - as we can then be much more flexible in how we 'sum up' the weekly 'absolute exposures'!
Finally, thought it might be worth putting our checklist from Friday 20th Jan 23 in here so things could actually be checked off!
Top to-dos:
[x] Check query for 22D as most of the defining mutations seem empty [Moira] (Seems to be fine)
[x] Modify how we're storing # of cases to be absolute rather than cumulative (see comment) ?
[ ] Integrate the estimated case numbers from OWID
[x] Add India, Brazil, Thailand, Australia to the countries
[ ] Do the testing on the 3 test cases to see if we can see the difference we should see - I think (?) we can just use the plots we generate above? (Possible exception 21L)
[x] Ask Chaoran about whether better way to get time data from CoV-Spec [Emma] (No way to do this, faster querying possible in a couple months)
Cases are now not accumulated before storing to json.
Also, I downloaded the IHME model csv and ran the pipeline with the estimated case counts (instead of confirmed cases). This needs a new script (include_case_counts_estimates.py
) which does essentially the same as include_case_counts.py, but for the estimated cases. Here's some example output:
The plots look very different now! Many curves go over 100%, so dealing with reinfection will be relevant.
TODOS:
This PR aims at introducing a mutation analysis step to covariants. The goal is, for a given mutation and country, to obtain an estimate of what percentage of the population has encountered this mutation before. This is done by combining case counts data (which estimates the number of cases per bi-weekly interval for each cluster in a given country) with mutation proportions taken from covSPECTRUM (which mutation has been seen at what percentage in a given cluster).