mzelst / covid-19

https://doi.org/10.5281/zenodo.5163263
Creative Commons Zero v1.0 Universal
62 stars 22 forks source link

Compress original datasets to reduce repository size #16

Closed DavyLandman closed 3 years ago

DavyLandman commented 4 years ago

Currently the repository grows quite a bit per day, taking a fresh clone took 600MB. Luckily git already does internal de-duplication of chunks of the data files else it would have been a bit more than 2GB.

Most of the data usage goes into data-rivm. Compression is an easy trick to reduce the file size of commits and the git repo.

R (and other programs) have automatic support for reading compressed csv's. If you for example compress the files with gzip (the stream compressor: gzip -9 -k COVID-19_casus_landelijk_2020-10-20.csv, not the zip application). You reduce casus file size from 26MB to 884KB. (xz (aka lzma2/7zip) would take it down to 761KB at the cost of slower code):

$ ll -h  COVID-19_casus_landelijk_2020-10-20.csv*
-rw-r--r-- 1 Davy   26M Oct 21 14:08 COVID-19_casus_landelijk_2020-10-20.csv
-rw-r--r-- 1 Davy  884K Oct 21 14:08 COVID-19_casus_landelijk_2020-10-20.csv.gz

In R, if you open a csv.gz file, it will automatically decompress the file in memory before parsing the csv.

edwinveldhuizen commented 4 years ago

Another downside is that data is no longer readable on GitHub: https://github.com/mzelst/covid-19/blob/b1c27680a6a648439bd00b58b043bda0f0948112/data/municipality-today-detailed.csv.gz

@mzelst maybe something to consider only for the archive? @DavyLandman feel free to create a pull request ;)

DavyLandman commented 4 years ago

True, I had a longer discussion on Discord with @mzelst for long term archiving, maybe something like Zenodo is a good way. As slowely you'll get into limits of github/git. Slow clones and friends.

I'm not doing a PR unless I know it's usefull and priority enough to be worth the time spend.

ghost commented 4 years ago

@edwinveldhuizen The larger files like COVID-19_casus_landelijk.csv aren't readable on the GitHub site anyway: (Sorry about that, but we can’t show files that are this big right now.)

Also I'm not sure whether decompressing the datasets would actually slow down the scripts: it's a lot faster to read 1MB instead of 30MB.

DavyLandman commented 4 years ago

Also I'm not sure whether decompressing the datasets would actually slow down the scripts: it's a lot faster to read 1MB instead of 30MB.

True, although that starts to play a role when you get 100MB+ data sets. (at which point you might also wanna switch to fread from data.tables).

mzelst commented 3 years ago

This suggestion has now been implemented.