Closed loleg closed 4 years ago
thanks, @loleg, for the input. we need to groom the repo, indeed. however, now that the ressources in the repo are infrastructural, we must not do that quickly and without advance notice.
Please come up with proposals, thanks!
I think for a dataset to be useful it has to be used by people. This isn't as much of a tautology as it appears at first.
To use a dataset you need to do a lot of work understanding it. In terms of what the data is, what it is not, how it is structured, the file format(s), how to combine/split it, limitations, and a lot more. Some of these steps are technical (like loading it) and some are not. To help with the technical hurdles I think it is good to have documentation and examples of doing that. Documentation means prose, schemas, etc. Examples tend to be code snippets. They have to be fairly short and simplistic, not full "ready for consumers" projects. The goal should be to provide steps that everyone has to do if they want to use the data. Leaving out the parts that only some users might want to do. Finding this balance is tricky.
This is what I tried to do with the notebook I added. It loads the data and treats missing values. It then goes on to compute a number and make a plot. The idea was to provide something others could take and do "their thing" on top of (or translate to their project). The insight gained from reading the example is about how you could use the data. It is written in Python and uses pandas because I thought those were very popular and widely used tools. The goal was to maximise the chances of people being able to read and understand it.
It would be super cool to have such "starterkits" in lots of languages/tools. It would increase the number of people who use the data and hence make the dataset more useful. However it isn't free to create and maintain examples. So an example in R is probably worth it. One in C++ maybe not so much. Which languages/tools to pick and which not is probably a judgement call.
I think it would be great to expand the list of "consumer grade" uses of the data which is featured in the README already. The projects listed there do far more than just load the data and help you get started. They are what most people will want to look at.
I think it is good to keep this repo focused on data collection but I agree that few code snippets to show how to work with this data might be useful. In case you want to use the R code #264 I wrote to load the data, feel free to do it.
Data "only". I think people, and reviewers are already pretty busy with the data itself. Can't push them to be proficient with all analysis tools and languages too.
I appreciate the Jupyter notebook and R code already very much (they are really nice!), but even that should probably go to other repo. Even my bash scripts that do some rudimentary analyses should go somewhere else. Only validation scripts should remain IMHO.
If we focus on data schema a bit more, we could probably make it happen.
As of 2.4.2020 the BAG report more cases ( 18 267 ) than the sum of the cantons ( 17’976 at 13:20 ). The list of differences looks quite strange
case diff | day delay | |
---|---|---|
VD | -361 | 2 |
BS | -88 | 0 |
BE | -77 | 1 |
TI | -61 | 1 |
AG | -39 | 1 |
BL | -19 | 1 |
FR | -6 | 0 |
NW | -5 | 1 |
FL | -1 | 1 |
JU | -1 | 1 |
GL | 0 | 1 |
SH | 0 | 0 |
VS | 2 | 1 |
GR | 5 | 0 |
SG | 5 | 0 |
UR | 5 | 0 |
AI | 6 | 0 |
AR | 6 | 0 |
OW | 7 | 1 |
NE | 13 | 2 |
SZ | 15 | 0 |
LU | 18 | 0 |
SO | 19 | 0 |
TG | 22 | 0 |
ZG | 38 | 0 |
ZH | 124 | 1 |
GE | 248 | 1 |
Means ZH and GE report more cases than included in the BAG report even though the data is one day older. @jb3-2 On the other side BS seems to report less cases, but have the same reporting date.
@zukunft Thanks for this analysis! Can you maybe add the link to the BAG table you used to calculate these differences? I'll have a look to find out why BS data are so different - maybe we'll find a pattern here.
I simply used https://www.bag.admin.ch/dam/bag/de/dokumente/mt/k-und-i/aktuelle-ausbrueche-pandemien/2019-nCoV/covid-19-datengrundlage-lagebericht.xlsx.download.xlsx/200325_Datengrundlage_Grafiken_COVID-19-Bericht.xlsx and snaped the data from https://rsalzer.github.io/COVID_19_CH/ from around 13:30.
date | cases | |
---|---|---|
AG | 2020-04-01 | 549 |
AI | 2020-04-02 | 19 |
AR | 2020-04-02 | 63 |
BE | 2020-04-01 | 909 |
BL | 2020-04-01 | 588 |
BS | 2020-04-02 | 718 |
FL | 2020-04-01 | 72 |
FR | 2020-04-02 | 550 |
GE | 2020-04-01 | 2702 |
GL | 2020-04-01 | 56 |
GR | 2020-04-02 | 569 |
JU | 2020-04-01 | 144 |
LU | 2020-04-02 | 422 |
NE | 2020-03-31 | 366 |
NW | 2020-04-01 | 70 |
OW | 2020-04-01 | 48 |
SG | 2020-04-02 | 455 |
SH | 2020-04-02 | 44 |
SO | 2020-04-02 | 227 |
SZ | 2020-04-02 | 155 |
TG | 2020-04-02 | 179 |
TI | 2020-04-01 | 2195 |
UR | 2020-04-02 | 59 |
VD | 2020-03-31 | 3465 |
VS | 2020-04-01 | 1145 |
ZG | 2020-04-02 | 131 |
ZH | 2020-04-01 | 2148 |
Closing this, as we all seem to agree, that we should focus on data.
There have been a couple of code snippets added to provide "simple" exploration of the data. Depending on the programming skill, this may not be simple at all. And why stop at Python and R? Why not write examples in 773 languages? My point is: let's please keep this repository very focused on data collection, scraping and aggregation, schema and structure.
There is already a section in the README and we could create a
covid_19_showcase
repo linking to the work by @betatim and @ivanek and all the others working with this data.