hodcroftlab / covariants

Real-time updates and information about key SARS-CoV-2 variants, plus the scripts that generate this information.
https://covariants.org/
GNU Affero General Public License v3.0
316 stars 111 forks source link

Move data out of git repository #301

Open ivan-aksamentov opened 2 years ago

ivan-aksamentov commented 2 years ago

Git repo has grown beyond any reasonable size due to large amounts of data committed into it over the months:

$ git gc
$ du -bsch .git
1.4G    .git

The worst offenders are

As a result:

This is not intended use of git. It was not designed for that. And at this rate it is unsustainable to continue this way.

I propose to move data away from the git repo, and to only store the code there.

Data, both the final web data, and intermediate data, can be uploaded to another service, e.g. AWS S3, and/or GitHub Releases, and then fetched from there.

Some of the disadvantages and difficulties of this approach:

We need to figure out an optimal workflow, such that the scientific activities are not disrupted, and that the correctness is fully preserved. Let's discuss this internally on Slack.

These measures will slow down the growth, but they will not make the git repo smaller. So, additionally we may consider to prune the old data forcefully from the git history, or, as a radical measure, to start over an make a new git repo. This will help to make dev experience better.

ivan-aksamentov commented 2 years ago

An important point is that this, along with Split web data into chunks #303, will break any external usage of the web data (i.e. people on the internet using our JSON files)

Despite we never supported this use-case, and that we don't know most of the downstream users, CoVariants has become an important source of information related to public health, so we need to make this transition graceful by: