Keep the focus on the data

loleg commented 4 years ago

There have been a couple of code snippets added to provide "simple" exploration of the data. Depending on the programming skill, this may not be simple at all. And why stop at Python and R? Why not write examples in 773 languages? My point is: let's please keep this repository very focused on data collection, scraping and aggregation, schema and structure.

There is already a section in the README and we could create a covid_19_showcase repo linking to the work by @betatim and @ivanek and all the others working with this data.

andreasamsler commented 4 years ago

thanks, @loleg, for the input. we need to groom the repo, indeed. however, now that the ressources in the repo are infrastructural, we must not do that quickly and without advance notice.

Please come up with proposals, thanks!

betatim commented 4 years ago

I think for a dataset to be useful it has to be used by people. This isn't as much of a tautology as it appears at first.

To use a dataset you need to do a lot of work understanding it. In terms of what the data is, what it is not, how it is structured, the file format(s), how to combine/split it, limitations, and a lot more. Some of these steps are technical (like loading it) and some are not. To help with the technical hurdles I think it is good to have documentation and examples of doing that. Documentation means prose, schemas, etc. Examples tend to be code snippets. They have to be fairly short and simplistic, not full "ready for consumers" projects. The goal should be to provide steps that everyone has to do if they want to use the data. Leaving out the parts that only some users might want to do. Finding this balance is tricky.

This is what I tried to do with the notebook I added. It loads the data and treats missing values. It then goes on to compute a number and make a plot. The idea was to provide something others could take and do "their thing" on top of (or translate to their project). The insight gained from reading the example is about how you could use the data. It is written in Python and uses pandas because I thought those were very popular and widely used tools. The goal was to maximise the chances of people being able to read and understand it.

It would be super cool to have such "starterkits" in lots of languages/tools. It would increase the number of people who use the data and hence make the dataset more useful. However it isn't free to create and maintain examples. So an example in R is probably worth it. One in C++ maybe not so much. Which languages/tools to pick and which not is probably a judgement call.

I think it would be great to expand the list of "consumer grade" uses of the data which is featured in the README already. The projects listed there do far more than just load the data and help you get started. They are what most people will want to look at.

ivanek commented 4 years ago

I think it is good to keep this repo focused on data collection but I agree that few code snippets to show how to work with this data might be useful. In case you want to use the R code #264 I wrote to load the data, feel free to do it.

baryluk commented 4 years ago

Data "only". I think people, and reviewers are already pretty busy with the data itself. Can't push them to be proficient with all analysis tools and languages too.

I appreciate the Jupyter notebook and R code already very much (they are really nice!), but even that should probably go to other repo. Even my bash scripts that do some rudimentary analyses should go somewhere else. Only validation scripts should remain IMHO.

If we focus on data schema a bit more, we could probably make it happen.

zukunft commented 4 years ago

As of 2.4.2020 the BAG report more cases ( 18 267 ) than the sum of the cantons ( 17’976 at 13:20 ). The list of differences looks quite strange

	case diff	day delay
VD	-361	2
BS	-88	0
BE	-77	1
TI	-61	1
AG	-39	1
BL	-19	1
FR	-6	0
NW	-5	1
FL	-1	1
JU	-1	1
GL	0	1
SH	0	0
VS	2	1
GR	5	0
SG	5	0
UR	5	0
AI	6	0
AR	6	0
OW	7	1
NE	13	2
SZ	15	0
LU	18	0
SO	19	0
TG	22	0
ZG	38	0
ZH	124	1
GE	248	1

Means ZH and GE report more cases than included in the BAG report even though the data is one day older. @jb3-2 On the other side BS seems to report less cases, but have the same reporting date.

jb3-2 commented 4 years ago

@zukunft Thanks for this analysis! Can you maybe add the link to the BAG table you used to calculate these differences? I'll have a look to find out why BS data are so different - maybe we'll find a pattern here.

zukunft commented 4 years ago

I simply used https://www.bag.admin.ch/dam/bag/de/dokumente/mt/k-und-i/aktuelle-ausbrueche-pandemien/2019-nCoV/covid-19-datengrundlage-lagebericht.xlsx.download.xlsx/200325_Datengrundlage_Grafiken_COVID-19-Bericht.xlsx and snaped the data from https://rsalzer.github.io/COVID_19_CH/ from around 13:30.

	date	cases
AG	2020-04-01	549
AI	2020-04-02	19
AR	2020-04-02	63
BE	2020-04-01	909
BL	2020-04-01	588
BS	2020-04-02	718
FL	2020-04-01	72
FR	2020-04-02	550
GE	2020-04-01	2702
GL	2020-04-01	56
GR	2020-04-02	569
JU	2020-04-01	144
LU	2020-04-02	422
NE	2020-03-31	366
NW	2020-04-01	70
OW	2020-04-01	48
SG	2020-04-02	455
SH	2020-04-02	44
SO	2020-04-02	227
SZ	2020-04-02	155
TG	2020-04-02	179
TI	2020-04-01	2195
UR	2020-04-02	59
VD	2020-03-31	3465
VS	2020-04-01	1145
ZG	2020-04-02	131
ZH	2020-04-01	2148

metaodi commented 4 years ago

Closing this, as we all seem to agree, that we should focus on data.

openZH / covid_19

Keep the focus on the data #268