Closed baryluk closed 4 years ago
The links don't work any more. I think they are tied dynamically to particular instance of viewing (on my computer), and so now are gone.
There is probably way to get this data without a session, or create a session programmatically.
There are some information of HTTP REST API in Python here: https://github.com/tableau/server-client-python
And tutorial here: https://help.tableau.com/current/api/rest_api/en-us/REST/rest_api_get_started_tutorial_part_1.htm
The HTTP REST API itself also is documented, for example this is the method we are probably most interested in: https://help.tableau.com/current/api/rest_api/en-us/REST/rest_api_ref_datasources.htm#download_data_source and https://help.tableau.com/current/api/rest_api/en-us/REST/rest_api_ref_workbooksviews.htm#get_view
It looks like this: GET /api/api-version/sites/site-id/datasources/datasource-id/content
or GET /api/api-version/sites/site-id/views/view-id
and/or GET /api/api-version/sites/site-id/views/view-id/data
(It is also implemented in the mentioned official Python library).
Needs more digging to figure this out.
There is also JavaScript client, which is the same as used for the visualisations itself, and it might be possible to execute some of it in node.js, but it is possible that it has too many features, and will not work in node.js actually.
So far by inspecting html and javascript, the "path" parameter is shared/3WXJ8ZXN4
.
Hi @baryluk , thank you for working on this. I have found datasource as well some time ago and have started to feed it to Elasticsearch. So far I am downloading it manually via https://covid-19-schweiz.bagapps.ch/de-1.html. Have you found a way already to automate the download? If it is not possible by using the API, I could help by building a scraper based on synthetic-monitoring. In terms of the dataset, I am using the version with all columns and all lines. As you mentioned already, the number of lines corresponds to the number of confirmed cases, which is as detailed as it can get. There even exists a column called f1, which seem to contain the case number what could simplify updating the data. The problem is that theese numbers seem to somehow change over time. On each of the updates I made, more than 1000 numbers did not exist on the new data anymore, still the amount of lines of the new dataset was correct. In addition, a lot of data is redundant, i.e. exists in german, french or abbreviations. Right now I plan to implement some homologation on the import. Although creating a sanitized, englisch version would be the better solution. As well if we could store the data in public, ideally on this repo, since it is used by a lot of people and solutions. If collaboration is welcome, please let me know.
Hi Bernhard,
Any input sensitization and cross reference to validate the data is welcome.
I didn't work on automatic it yet, as I had a soso week, but I do hope to work on it later.
I suggest starting collecting the data in your own repo on GitHub, if you have meams to do it.
Cheers, Witold
On Tue, 14 Apr 2020, 22:58 Bernhard Fluehmann, notifications@github.com wrote:
Hi @baryluk https://github.com/baryluk , thank you for working on this. I have found datasource as well some time ago and have started to feed it to Elasticsearch. So far I am downloading it manually via https://covid-19-schweiz.bagapps.ch/de-1.html. Have you found a way already to automate the download? If it is not possible by using the API, I could help by building a scraper based on synthetic-monitoring. In terms of the dataset, I am using the version with all columns and all lines. As you mentioned already, the number of lines corresponds to the number of confirmed cases, which is as detailed as it can get. There even exists a column called f1, which seem to contain the case number what could simplify updating the data. The problem is that theese numbers seem to somehow change over time. On each of the updates I made, more than 1000 numbers did not exist on the new data anymore, still the amount of lines of the new dataset was correct. In addition, a lot of data is redundant, i.e. exists in german, french or abbreviations. Right now I plan to implement some homologation on the import. Although creating a sanitized, englisch version would be the better solution. As well if we could store the data in public, ideally on this repo, since it is used by a lot of people and solutions. If collaboration is welcome, please let me know.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openZH/covid_19/issues/492#issuecomment-613677901, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAA254QFWAAZ5GAOXT3A6ZLRMTE63ANCNFSM4MES4LPQ .
Hi Witold, I have just created an own repo. It contains one original file, a first draft of a converted file and a python script to do the conversion. Scraping will be done manually for the moment. Please feel free to have a look. Comments are highly welcome
Regards Bernhard
Hi @baryluk , If you are interested, a python/selenium scraper is now available on my repo.
Cheers Bernhard
I just compared the number of deceased of the cantons and the BAG:
Canton | date | Canton number | BAG number | diff | in percent |
---|---|---|---|---|---|
JU | 43948 | 7 | 1 | 6 | 600% |
SH | 43949 | 6 | 2 | 4 | 200% |
NE | 43943 | 65 | 25 | 40 | 160% |
VS | 43948 | 132 | 86 | 46 | 53% |
TI | 43948 | 311 | 219 | 92 | 42% |
ZG | 43946 | 8 | 6 | 2 | 33% |
VD | 43947 | 355 | 267 | 88 | 33% |
SO | 43949 | 15 | 12 | 3 | 25% |
BE | 43948 | 83 | 72 | 11 | 15% |
GE | 43947 | 239 | 222 | 17 | 8% |
GR | 43947 | 43 | 40 | 3 | 8% |
BL | 43948 | 30 | 28 | 2 | 7% |
AG | 43948 | 33 | 31 | 2 | 6% |
TG | 43948 | 17 | 16 | 1 | 6% |
ZH | 43948 | 115 | 112 | 3 | 3% |
FR | 43948 | 78 | 78 | 0 | 0% |
SG | 43948 | 31 | 31 | 0 | 0% |
SZ | 43948 | 18 | 18 | 0 | 0% |
GL | 43948 | 7 | 7 | 0 | 0% |
UR | 43948 | 5 | 5 | 0 | 0% |
AR | 43948 | 3 | 3 | 0 | 0% |
NW | 43948 | 3 | 3 | 0 | 0% |
LU | 43948 | 16 | 17 | -1 | -6% |
BS | 43948 | 46 | 50 | -4 | -8% |
CH | TOTAL | 1666 | 1351 | 315 |
Looks to me that in many cantons the difference is small and is most likely depending on the reporting time. But in NE, VS, TI and VD I guess the difference has some other reason. Maybe NE can be an indication: The number of persons "Décès hospitalier" until now based on the data from the canton is 22 and the BAG reports 25, whereas the total number is 65 in the canton.
It is possible that the difference between the numbers of the canton and the BAG is due to testing criteria. Some cantons declare a COVID-19 positive case also based on a CT scan.
@BFLB Selenium scraper sounds like an interesting idea. Have you had a continues stream of archived data for last 2 weeks with it?
@baryluk The scrapper works fine most of the time and needed some changes once in a while. Last week was a game changer. BAG changed the data model of the csv file from individual cases to aggregations. Now, there is a line per groups of gender-ageClass-Canton-Date-(either Confirmed Date or Death Date). This adds around 1000 lines per day. Most of them contain 0 values, since the curve flattened.
What I will do next is to provide a lean version of the converted csv file with only non zero data sets. This should drastically reduce size. For most use cases this should work.
I adapted the new model last week and refactored the code today. If everything works smooth I will start running the scrapper as a scheduled task until the end of the week to fully automate the process.
Finally I would like to scrap the number of tests done as soon as I have time, since these numbers are published as well since a week or two.
It seems that no one is working on this issue currently (i.e. to cross-reference the data from here and BAG). I'm closing this for now, but feel free to re-open it if needed.
I tried to get the confirmation from SH and BAG for the number of deceased, but no one wanted to confirm this issue. It seems that the BAG is working now also working on a fully digital version, that should be ready for the second wave. So maybe soon we will see more details.
I just found BAG is now providing very detailed data dump, with breakdown by canton, age group, sex, fully historicized:
Also, as far as I can see, there is no data for Principality of Lichtenstein there.
I got this links from https://covid-19-schweiz.bagapps.ch/de-2.html and https://covid-19-schweiz.bagapps.ch/de-1.html , but the interface requires me to select the end date, so they will most likely break and/or not have all data by tomorrow. So for completeness I am attaching archive with the files:
BAG_tableau_csv_2020-04-09.tar.gz
I guess it might be useful to develop a tool to cross reference the data, and compare with what we store in the repo? Or maybe even publish in a separate directory too in this repo?