Detailed data from BAG available

baryluk commented 4 years ago

I just found BAG is now providing very detailed data dump, with breakdown by canton, age group, sex, fully historicized:

Confirmed cases: Full data set (23406 row!): https://public.tableau.com/vizql/w/Covid19_15852360559170/v/Dashboard2d/vud/sessions/CA00DDC7BA8C45ED9B68A32E75D9D45E-0:0/views/8353695473959107859_14843029294595766421?csv=true&showall=true (6MB in size)
Confirmed cases: Full data set with only essential columns (still 23406 rows) https://public.tableau.com/vizql/w/Covid19_15852360559170/v/Dashboard2d/vud/sessions/CA00DDC7BA8C45ED9B68A32E75D9D45E-0:0/views/8353695473959107859_14843029294595766421?csv=true (1.5MB in size)
Confirmed cases: Summary with just cumulative number of confirmed cases per canton on each day (1171 rows; ) https://public.tableau.com/vizql/w/Covid19_15852360559170/v/Dashboard2d/vud/sessions/CA00DDC7BA8C45ED9B68A32E75D9D45E-0:0/views/8353695473959107859_14843029294595766421?csv=true&summary=true (37kB in size)
Deceased: Cumulative number of deaths for each day for whole Switzerland (65 rows): https://public.tableau.com/vizql/w/Covid19_15852360559170/v/Dashboard2d/vud/sessions/7871939228C64E48A627EBB5B460E8F3-0:0/views/8353695473959107859_1567061438944023373?csv=true&summary=true (2kB in size)
Deceased: Full data set (753 rows): https://public.tableau.com/vizql/w/Covid19_15852360559170/v/Dashboard2d/vud/sessions/7871939228C64E48A627EBB5B460E8F3-0:0/views/8353695473959107859_1567061438944023373?csv=true&showall=true (188kB in size)

Also, as far as I can see, there is no data for Principality of Lichtenstein there.

I got this links from https://covid-19-schweiz.bagapps.ch/de-2.html and https://covid-19-schweiz.bagapps.ch/de-1.html , but the interface requires me to select the end date, so they will most likely break and/or not have all data by tomorrow. So for completeness I am attaching archive with the files:

BAG_tableau_csv_2020-04-09.tar.gz

I guess it might be useful to develop a tool to cross reference the data, and compare with what we store in the repo? Or maybe even publish in a separate directory too in this repo?

baryluk commented 4 years ago

The links don't work any more. I think they are tied dynamically to particular instance of viewing (on my computer), and so now are gone.

There is probably way to get this data without a session, or create a session programmatically.

There are some information of HTTP REST API in Python here: https://github.com/tableau/server-client-python

And tutorial here: https://help.tableau.com/current/api/rest_api/en-us/REST/rest_api_get_started_tutorial_part_1.htm

The HTTP REST API itself also is documented, for example this is the method we are probably most interested in: https://help.tableau.com/current/api/rest_api/en-us/REST/rest_api_ref_datasources.htm#download_data_source and https://help.tableau.com/current/api/rest_api/en-us/REST/rest_api_ref_workbooksviews.htm#get_view

It looks like this: GET /api/api-version/sites/site-id/datasources/datasource-id/content or GET /api/api-version/sites/site-id/views/view-id and/or GET /api/api-version/sites/site-id/views/view-id/data (It is also implemented in the mentioned official Python library).

Needs more digging to figure this out.

There is also JavaScript client, which is the same as used for the visualisations itself, and it might be possible to execute some of it in node.js, but it is possible that it has too many features, and will not work in node.js actually.

So far by inspecting html and javascript, the "path" parameter is shared/3WXJ8ZXN4.

BFLB commented 4 years ago

Hi @baryluk , thank you for working on this. I have found datasource as well some time ago and have started to feed it to Elasticsearch. So far I am downloading it manually via https://covid-19-schweiz.bagapps.ch/de-1.html. Have you found a way already to automate the download? If it is not possible by using the API, I could help by building a scraper based on synthetic-monitoring. In terms of the dataset, I am using the version with all columns and all lines. As you mentioned already, the number of lines corresponds to the number of confirmed cases, which is as detailed as it can get. There even exists a column called f1, which seem to contain the case number what could simplify updating the data. The problem is that theese numbers seem to somehow change over time. On each of the updates I made, more than 1000 numbers did not exist on the new data anymore, still the amount of lines of the new dataset was correct. In addition, a lot of data is redundant, i.e. exists in german, french or abbreviations. Right now I plan to implement some homologation on the import. Although creating a sanitized, englisch version would be the better solution. As well if we could store the data in public, ideally on this repo, since it is used by a lot of people and solutions. If collaboration is welcome, please let me know.

baryluk commented 4 years ago

Hi Bernhard,

Any input sensitization and cross reference to validate the data is welcome.

I didn't work on automatic it yet, as I had a soso week, but I do hope to work on it later.

I suggest starting collecting the data in your own repo on GitHub, if you have meams to do it.

Cheers, Witold

On Tue, 14 Apr 2020, 22:58 Bernhard Fluehmann, notifications@github.com wrote:

Hi @baryluk https://github.com/baryluk , thank you for working on this. I have found datasource as well some time ago and have started to feed it to Elasticsearch. So far I am downloading it manually via https://covid-19-schweiz.bagapps.ch/de-1.html. Have you found a way already to automate the download? If it is not possible by using the API, I could help by building a scraper based on synthetic-monitoring. In terms of the dataset, I am using the version with all columns and all lines. As you mentioned already, the number of lines corresponds to the number of confirmed cases, which is as detailed as it can get. There even exists a column called f1, which seem to contain the case number what could simplify updating the data. The problem is that theese numbers seem to somehow change over time. On each of the updates I made, more than 1000 numbers did not exist on the new data anymore, still the amount of lines of the new dataset was correct. In addition, a lot of data is redundant, i.e. exists in german, french or abbreviations. Right now I plan to implement some homologation on the import. Although creating a sanitized, englisch version would be the better solution. As well if we could store the data in public, ideally on this repo, since it is used by a lot of people and solutions. If collaboration is welcome, please let me know.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openZH/covid_19/issues/492#issuecomment-613677901, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA254QFWAAZ5GAOXT3A6ZLRMTE63ANCNFSM4MES4LPQ .

BFLB commented 4 years ago

Hi Witold, I have just created an own repo. It contains one original file, a first draft of a converted file and a python script to do the conversion. Scraping will be done manually for the moment. Please feel free to have a look. Comments are highly welcome

Regards Bernhard

BFLB commented 4 years ago

Hi @baryluk , If you are interested, a python/selenium scraper is now available on my repo.

Cheers Bernhard

zukunft commented 4 years ago

I just compared the number of deceased of the cantons and the BAG:

Canton	date	Canton number	BAG number	diff	in percent
JU	43948	7	1	6	600%
SH	43949	6	2	4	200%
NE	43943	65	25	40	160%
VS	43948	132	86	46	53%
TI	43948	311	219	92	42%
ZG	43946	8	6	2	33%
VD	43947	355	267	88	33%
SO	43949	15	12	3	25%
BE	43948	83	72	11	15%
GE	43947	239	222	17	8%
GR	43947	43	40	3	8%
BL	43948	30	28	2	7%
AG	43948	33	31	2	6%
TG	43948	17	16	1	6%
ZH	43948	115	112	3	3%
FR	43948	78	78	0	0%
SG	43948	31	31	0	0%
SZ	43948	18	18	0	0%
GL	43948	7	7	0	0%
UR	43948	5	5	0	0%
AR	43948	3	3	0	0%
NW	43948	3	3	0	0%
LU	43948	16	17	-1	-6%
BS	43948	46	50	-4	-8%
CH	TOTAL	1666	1351	315

Looks to me that in many cantons the difference is small and is most likely depending on the reporting time. But in NE, VS, TI and VD I guess the difference has some other reason. Maybe NE can be an indication: The number of persons "Décès hospitalier" until now based on the data from the canton is 22 and the BAG reports 25, whereas the total number is 65 in the canton.

zukunft commented 4 years ago

It is possible that the difference between the numbers of the canton and the BAG is due to testing criteria. Some cantons declare a COVID-19 positive case also based on a CT scan.

baryluk commented 4 years ago

@BFLB Selenium scraper sounds like an interesting idea. Have you had a continues stream of archived data for last 2 weeks with it?

BFLB commented 4 years ago

@baryluk The scrapper works fine most of the time and needed some changes once in a while. Last week was a game changer. BAG changed the data model of the csv file from individual cases to aggregations. Now, there is a line per groups of gender-ageClass-Canton-Date-(either Confirmed Date or Death Date). This adds around 1000 lines per day. Most of them contain 0 values, since the curve flattened.

What I will do next is to provide a lean version of the converted csv file with only non zero data sets. This should drastically reduce size. For most use cases this should work.

I adapted the new model last week and refactored the code today. If everything works smooth I will start running the scrapper as a scheduled task until the end of the week to fully automate the process.

Finally I would like to scrap the number of tests done as soon as I have time, since these numbers are published as well since a week or two.

metaodi commented 4 years ago

It seems that no one is working on this issue currently (i.e. to cross-reference the data from here and BAG). I'm closing this for now, but feel free to re-open it if needed.

zukunft commented 4 years ago

I tried to get the confirmation from SH and BAG for the number of deceased, but no one wanted to confirm this issue. It seems that the BAG is working now also working on a fully digital version, that should be ready for the second wave. So maybe soon we will see more details.

openZH / covid_19

Detailed data from BAG available #492