openZH / covid_19

COVID19 case numbers of Cantons of Switzerland and Principality of Liechtenstein (FL). The data is updated at best once a day (times of collection and update may vary). Start with the README.
https://www.zh.ch/de/gesundheit/coronavirus/zahlen-fakten-covid-19.zhweb-noredirect.zhweb-cache.html?keywords=covid19&keyword=covid19#/
Creative Commons Attribution 4.0 International
425 stars 177 forks source link

Keep the focus on the data #268

Closed loleg closed 4 years ago

loleg commented 4 years ago

There have been a couple of code snippets added to provide "simple" exploration of the data. Depending on the programming skill, this may not be simple at all. And why stop at Python and R? Why not write examples in 773 languages? My point is: let's please keep this repository very focused on data collection, scraping and aggregation, schema and structure.

There is already a section in the README and we could create a covid_19_showcase repo linking to the work by @betatim and @ivanek and all the others working with this data.

andreasamsler commented 4 years ago

thanks, @loleg, for the input. we need to groom the repo, indeed. however, now that the ressources in the repo are infrastructural, we must not do that quickly and without advance notice.

Please come up with proposals, thanks!

betatim commented 4 years ago

I think for a dataset to be useful it has to be used by people. This isn't as much of a tautology as it appears at first.

To use a dataset you need to do a lot of work understanding it. In terms of what the data is, what it is not, how it is structured, the file format(s), how to combine/split it, limitations, and a lot more. Some of these steps are technical (like loading it) and some are not. To help with the technical hurdles I think it is good to have documentation and examples of doing that. Documentation means prose, schemas, etc. Examples tend to be code snippets. They have to be fairly short and simplistic, not full "ready for consumers" projects. The goal should be to provide steps that everyone has to do if they want to use the data. Leaving out the parts that only some users might want to do. Finding this balance is tricky.

This is what I tried to do with the notebook I added. It loads the data and treats missing values. It then goes on to compute a number and make a plot. The idea was to provide something others could take and do "their thing" on top of (or translate to their project). The insight gained from reading the example is about how you could use the data. It is written in Python and uses pandas because I thought those were very popular and widely used tools. The goal was to maximise the chances of people being able to read and understand it.

It would be super cool to have such "starterkits" in lots of languages/tools. It would increase the number of people who use the data and hence make the dataset more useful. However it isn't free to create and maintain examples. So an example in R is probably worth it. One in C++ maybe not so much. Which languages/tools to pick and which not is probably a judgement call.

I think it would be great to expand the list of "consumer grade" uses of the data which is featured in the README already. The projects listed there do far more than just load the data and help you get started. They are what most people will want to look at.

ivanek commented 4 years ago

I think it is good to keep this repo focused on data collection but I agree that few code snippets to show how to work with this data might be useful. In case you want to use the R code #264 I wrote to load the data, feel free to do it.

baryluk commented 4 years ago

Data "only". I think people, and reviewers are already pretty busy with the data itself. Can't push them to be proficient with all analysis tools and languages too.

I appreciate the Jupyter notebook and R code already very much (they are really nice!), but even that should probably go to other repo. Even my bash scripts that do some rudimentary analyses should go somewhere else. Only validation scripts should remain IMHO.

If we focus on data schema a bit more, we could probably make it happen.

zukunft commented 4 years ago

As of 2.4.2020 the BAG report more cases ( 18 267 ) than the sum of the cantons ( 17’976 at 13:20 ). The list of differences looks quite strange

  case diff day delay
VD -361 2
BS -88 0
BE -77 1
TI -61 1
AG -39 1
BL -19 1
FR -6 0
NW -5 1
FL -1 1
JU -1 1
GL 0 1
SH 0 0
VS 2 1
GR 5 0
SG 5 0
UR 5 0
AI 6 0
AR 6 0
OW 7 1
NE 13 2
SZ 15 0
LU 18 0
SO 19 0
TG 22 0
ZG 38 0
ZH 124 1
GE 248 1

Means ZH and GE report more cases than included in the BAG report even though the data is one day older. @jb3-2 On the other side BS seems to report less cases, but have the same reporting date.

jb3-2 commented 4 years ago

@zukunft Thanks for this analysis! Can you maybe add the link to the BAG table you used to calculate these differences? I'll have a look to find out why BS data are so different - maybe we'll find a pattern here.

zukunft commented 4 years ago

I simply used https://www.bag.admin.ch/dam/bag/de/dokumente/mt/k-und-i/aktuelle-ausbrueche-pandemien/2019-nCoV/covid-19-datengrundlage-lagebericht.xlsx.download.xlsx/200325_Datengrundlage_Grafiken_COVID-19-Bericht.xlsx and snaped the data from https://rsalzer.github.io/COVID_19_CH/ from around 13:30.

  date cases
AG 2020-04-01 549
AI 2020-04-02 19
AR 2020-04-02 63
BE 2020-04-01 909
BL 2020-04-01 588
BS 2020-04-02 718
FL 2020-04-01 72
FR 2020-04-02 550
GE 2020-04-01 2702
GL 2020-04-01 56
GR 2020-04-02 569
JU 2020-04-01 144
LU 2020-04-02 422
NE 2020-03-31 366
NW 2020-04-01 70
OW 2020-04-01 48
SG 2020-04-02 455
SH 2020-04-02 44
SO 2020-04-02 227
SZ 2020-04-02 155
TG 2020-04-02 179
TI 2020-04-01 2195
UR 2020-04-02 59
VD 2020-03-31 3465
VS 2020-04-01 1145
ZG 2020-04-02 131
ZH 2020-04-01 2148
metaodi commented 4 years ago

Closing this, as we all seem to agree, that we should focus on data.