hodcroftlab / covariants

Real-time updates and information about key SARS-CoV-2 variants, plus the scripts that generate this information.
https://covariants.org/
GNU Affero General Public License v3.0
316 stars 111 forks source link

Remove Zeros from Datasets #291

Closed emmahodcroft closed 2 years ago

emmahodcroft commented 2 years ago

As the repo gets larger, in particular, the building on CircleCI fails more.

To try and reduce file size, try to remove zero values from the files that are built for the web app. (see also discussion here)

Here., I've tried to do this for Per Country - as a test of whether the web app handles this gracefully.

Currently there are still entries for weeks that have no sequences, and they will have week and total_sequences values - this seemed least likely to break. But there will be no entries for the clusters.

Let's see if this works.

If it does, and helps - we should also remove zeros from Per Variant.

This will be more complicated because we use 'real' and 'smoothed' values - we only want to remove when both are zero I think. Perhaps this should be part of the bigger project to move to using real-values for the Per Variant dataset.

vercel[bot] commented 2 years ago

This pull request is being automatically deployed with Vercel (learn more).
To see the status of your deployment, click below or on the icon next to each commit.

🔍 Inspect: https://vercel.com/hodcroftlab/covariants/H5wVKt2mkgSaXvwa9s643NeC6PKZ
✅ Preview: https://covariants-git-percountrynozeros-hodcroftlab.vercel.app

emmahodcroft commented 2 years ago

Also @ivan-aksamentov - this was my best go at making this happen in the code, but it may not be python graceful... if you see a better way, please feel free to improve the code!

emmahodcroft commented 2 years ago

This seems to not only work - but also solves the Tooltip issue 🙃

removezeros

ivan-aksamentov commented 2 years ago

@emmahodcroft I see that this is patching up the objects on output. This is basically just to prettify the files and make them smaller.

I was hoping that instead of just making output "nice" we could dive into what's actually happening - why some values are missing and others are 0? If these 2 situations are equivalent, then why the Python scripts producing this data chose one over another and under what conditions. That is, I was hoping for a more fundamental fix.

If we stick to just patching though, for full effect, it would be nice to also filter out the weeks which end up being empty.

emmahodcroft commented 2 years ago

I spoke too soon - this is not working. No idea why. image

why some values are missing and others are 0?

I think this may be something specific to the case-data PR. I think for the Per Country and Per Variant data, there aren't missing values if the "date" exists (ex: every week that exists will have a count for every variant) - but I haven't checked this exhaustively.

emmahodcroft commented 2 years ago

My best guess is that without having zero values the line doesn't know what to draw in between. Compare when all variants go to zero at the two data points on the edges of the grey bit: Live image

This PR image

elysiumplain commented 2 years ago

Just a guess - but looks like a "leading and trailing edge" problem. Removing the zero-points leaves the trailing edge in draw() with no edge smoothing for the next date. Likewise, removing a leading zero leaves you with an abrupt spire.

Test case may be single data point within timespan under new PR graph.

Possible algorithm may be in-line check (double pointer solution) prior to trimming.

*I do agree with @ivan-aksamentov that it may be better down the road to manage this logic at a dataclass layer.

emmahodcroft commented 2 years ago

Hi @elysiumplain - thank you for this! I will perhaps see if I can adjust so that zeros on 'either end' remain, and see if that solves the problem.

I looked into the underlying issue for Per Country and Cases (outside of Slack) and it seems like if a variant is never present in a country, then it will never have an entry in the base file (not even one full of zeros). If it's present at least once, it'll be present but may have some zero entries.

ivan-aksamentov commented 2 years ago

In the latest commit I disabled plot interpolation, so that it's easier to see what's happening, in particular the white holes.

https://covariants-sm8fwfd7w-hodcroftlab.vercel.app/

emmahodcroft commented 2 years ago

I think is best resolved with #319 as in this, the mouseover doesn't show true zero values. This doesn't address it as a file-size issue (if this is still potentially an issue?) - but we should open a separate issue for that, most likely, if that's a concern.