Programmatic analysis of variants

Hi CoVariants team!

As part of the Cambridge Festival @bethsampher & @JamesABaker (both at the Wellcome Genome Campus) ran a hackathon to spark ideas around making use of the large amount of sequencing data coming out of current Covid-19 projects.

A product of that was my (potentially) dominant variant finder. My thinking being, with enough localised data, could particularly dominant variants be spotted from the data programmatically?

Currently it is just using an old snapshot of the CoVariants country data file, however if it would be of some use to yourselves or others outside the project then I'd be happy to work to expand it to do this for the entire dataset routinely.

I'm a scientist, but can claim no expertise in this area (I'm a chemist...), therefore I'm unsure of it's utility or not. If the project is of some use we'd be very grateful for any feedback you could provide! More discussion can be found in the issue here

Thanks for your time,

Maddyboo

An algorithm that can fix this problem as well as the motivating arguments are below.

Alpha started in the UK and successfully outcompeted most other variants to become the dominant strain worldwide
Alpha itself was outcompeted by Delta, which has now become (or is rapidly on the way to becoming the dominant strain worldwide)
Anything that can outcompete Delta is worth looking at carefully and soon
Something is pushing back Delta in India – maybe AY.1, maybe AY.2, maybe something else. But whatever it is, we should understand what is outcompeting Delta.
This concept is related to: https://github.com/wgc-hackathon/covid/issues/14 and https://github.com/hodcroftlab/covariants/issues/197 – namely that the site should be detecting and presenting significant risks, which are indicated by variants that outcompete (push back) strong variants.
A simple algorithm that directly addresses the above and the question in issues 14 and 197 might be this:

Identify “dominant variants”: those that represent at least X% (X=10?) of Covid sequences generated worldwide in the last month
In each country (or state) see whether new variants are pushing back dominant variants using a chi-squared test comparing a past time period with a more recent time period.
For instance, in South Korea: i. Past time period: May 3-17 ii. More recent time period: May 17-31 iii. Alpha dropped from 71 to 3 iv. Others grew from 35 to 97 v. 2x2 chi-squared test gives p < 0.00001 (https://www.socscistatistics.com/tests/chisquare/default2.aspx)
Same test in India, comparing Delta vs. others, gives p = 0.04
Of course, we’d need to do this with specific variants, not the entire bucket of “others”.

hodcroftlab / covariants

Programmatic analysis of variants #143