hodcroftlab / covariants

Real-time updates and information about key SARS-CoV-2 variants, plus the scripts that generate this information.
https://covariants.org/
GNU Affero General Public License v3.0
316 stars 113 forks source link

ENH: Add India to case plots - India is really interesting case #346

Open corneliusroemer opened 1 year ago

corneliusroemer commented 1 year ago

Quite a few Asian countries are missing from the super useful cases section.

In particular, I'm dearly missing India, and also Indonesia, Malaysia, Vietnam.

These are populous countries with significance for variant evolution.

emmahodcroft commented 1 year ago

Yes, 100% feel you. Currently they aren't included as they don't pass the threshold for "at least 2% of cases in at least 40% of the 2-week periods tracked by CoVariants since the end of 2020." This is to try and ensure that the sequencing is a representative sample of the country - if sequencing numbers are too low, my concern is that they shouldn't really be translated over to cases.

Do you know how far off India might be from that?

corneliusroemer commented 1 year ago

I think your criteria are suboptimal. To ensure statistical robustness, it's more important to have large counts - rather than a large proportion of cases.

It doesn't really matter if it's 1 in 1000 or 1 in 50, in India your current criteria mean a country with good robustness drops out, but a country that reports low number of cases because of lack of PCR gets included even if it sequences only 5 a month.

I'd propose criteria: Show all countries but only those periods that match either: at least 50 sequences in a period or 1% of reported cases

I'd worry more about lack of absolute count of sequences than a lack of share. Sure it may be geographically skewed, but it's better than nothing.

I also don't quite understand why you're showing all the data even if coverage may be bad in 60% of time intervals.

I'd only show time intervals in which I trust the resolution. So that may mean not showing what happened in between but that's ok - since without sequencing we don't know what happened there. See my criteria above, they should make major sequencing countries to show data for all periods - some with less sequencing may show only for periods where they sequenced (that's good) and small countries may also not show in some periods if they didn't sequence much because of not many cases (e.g. China, Iceland, Australia etc.).

Right now the criteria seem to produce not ideal results:

emmahodcroft commented 1 year ago

Yes, the criteria aren't perfect, and there's trade-offs in getting good with bad. Generally I try to err on the side of not showing things that may be misleading, but it's far from a perfect balance.

Unfortunately to change this I'd need to get a good sense of how it's working what's included & not. The scripts that control/help this currently are the script that generates these plots currently and the script that lets one explore different thresholds (first, you have to run the first script with very low thresholds to include every/almost every country). I'm happy to take PRs, but on my own development front, apart from exploring threshold tweaks to the existing code, I'm afraid this is pretty low-priority at the moment - CoV always unfortunately comes last! (And even within CoV, this isn't at the top!) But happy to evaluate PRs & ideas.

Also, on this point:

Sure it may be geographically skewed, but it's better than nothing.

I'm not sure I agree. I think it can be very misleading if we do think all sequences are coming from one place, yet plotting it as if it's the whole country. Though this is more complicated to avoid, I would ideally want to avoid this.