Closed DavyLandman closed 3 years ago
Another downside is that data is no longer readable on GitHub: https://github.com/mzelst/covid-19/blob/b1c27680a6a648439bd00b58b043bda0f0948112/data/municipality-today-detailed.csv.gz
@mzelst maybe something to consider only for the archive? @DavyLandman feel free to create a pull request ;)
True, I had a longer discussion on Discord with @mzelst for long term archiving, maybe something like Zenodo is a good way. As slowely you'll get into limits of github/git. Slow clones and friends.
I'm not doing a PR unless I know it's usefull and priority enough to be worth the time spend.
@edwinveldhuizen The larger files like COVID-19_casus_landelijk.csv
aren't readable on the GitHub site anyway: (Sorry about that, but we can’t show files that are this big right now.)
Also I'm not sure whether decompressing the datasets would actually slow down the scripts: it's a lot faster to read 1MB instead of 30MB.
Also I'm not sure whether decompressing the datasets would actually slow down the scripts: it's a lot faster to read 1MB instead of 30MB.
True, although that starts to play a role when you get 100MB+ data sets. (at which point you might also wanna switch to fread
from data.tables
).
This suggestion has now been implemented.
Currently the repository grows quite a bit per day, taking a fresh clone took 600MB. Luckily git already does internal de-duplication of chunks of the data files else it would have been a bit more than 2GB.
Most of the data usage goes into
data-rivm
. Compression is an easy trick to reduce the file size of commits and the git repo.R (and other programs) have automatic support for reading compressed csv's. If you for example compress the files with gzip (the stream compressor:
gzip -9 -k COVID-19_casus_landelijk_2020-10-20.csv
, not thezip
application). You reduce casus file size from 26MB to 884KB. (xz (aka lzma2/7zip) would take it down to 761KB at the cost of slower code):In R, if you open a
csv.gz
file, it will automatically decompress the file in memory before parsing the csv.