better monitoring of the updating process

tlorusso commented 4 years ago

As more and more people is involved (yay!) in updating the data and building scrapers, we might have to start thinking about how to monitor the updating process, and think about how we can ensure that the data is being updated even if one of the scrapers fails or someone forgets to check for new data.

Do you habe suggestions how we could manage this @metaodi @baryluk @ebeusch @herrstucki @andreasamsler @zdavatz ?

A table in the Readme (or somewhere else?) which is refreshed automatically after each push to a single file might help, similar to what @baryluk has created here:

https://github.com/openZH/covid_19/issues/61

The table we have now is built by hand.

baryluk commented 4 years ago

That is a good point. Would be nice to know which stuff needs attention or is outdated.

ebeusch commented 4 years ago

I think at the moment we lack a good overview of which cantons are outdated (incl. historic data) and where it might be necessary to get information by going through news outlets etc.

At the moment there also seems to be some double work. This includes both scrapers and manual updates. So, once there are scrapers (@baryluk seems to get one out by the minute!) it might also be good to assign who is running the scrapers (+ second/third responsible in case people fall ill/are prevented from updating. Similarly, teams can be assigned to update cantons where it's not possible to have automatic data collection.

rokroskar commented 4 years ago

The covidtracking project (covidtracking.com) uses this: https://github.com/COVID19Tracking/covid-tracking - it generates diffs of the sites they are monitoring for updates and shoots off a message to slack when one needs attention. Seems like something similar might be useful here? I think here it would be great if the scrapers could be consolidated into one codebase so that the error reporting etc. can be a bit more standardized.

baryluk commented 4 years ago

Yeah. I did not expect to write so many, but they all work. I am going to write 3-4 more, and then I will put them into my fork of the repo, and add a script to run all of them, normalize various date formats, and create a consolidated output of new data. Then we could run it from cron, or manually. Once I have few running, I will open MR.

zdavatz commented 4 years ago

very good work @baryluk !!!!

metaodi commented 4 years ago

@baryluk I will work on consolidating the scrapers tonight, maybe we can exchange later on mattermost to join forces. I want to schedule all scrapers with Github actions.

zdavatz commented 4 years ago

I would go the way @baryluk suggests.

I suggest you have cronjob that runs his scrappers after the media conferences of the cantons or lets say every two hours. If the scrapper finds new data, then github actions should push the new numbers to github, so that data can be visualized.

Github action should only be called to update the data on github. The scrappers should be run by a cronjob.

baryluk commented 4 years ago

Progress report as of now:

$ ./meta_scrape.sh
AG 2020-03-20T15:00     168       - OK 2020-03-21T20:34:20+01:00
BE 2020-03-21T          377       3 OK 2020-03-21T20:34:20+01:00
BS 2020-03-21T10:00     299       - OK 2020-03-21T20:34:24+01:00
GE 2020-03-20T08:00     873       7 OK 2020-03-21T20:34:24+01:00
GR 2020-03-20T          213       3 OK 2020-03-21T20:34:27+01:00
JU 2020-03-21T18:00      49       - OK 2020-03-21T20:34:30+01:00
LU 2020-03-21T11:00     109       - OK 2020-03-21T20:34:30+01:00
NE 2020-03-21T15:30     177       2 OK 2020-03-21T20:34:31+01:00
SH 2020-03-20T           14       - OK 2020-03-21T20:34:31+01:00
TG 2020-03-21T           56       - OK 2020-03-21T20:34:32+01:00
UR 2020-03-21T08:00      12       - OK 2020-03-21T20:34:32+01:00
VS 2020-03-21T          359       9 OK 2020-03-21T20:34:32+01:00
XX - - - FAILED
ZH 2020-03-20T16:30     773       - OK 2020-03-21T20:34:32+01:00
$

XX is a test scraper that fails on purpose. Here to show that the rest of the scrapers run and are parsed.

zdavatz commented 4 years ago

looks great!

zdavatz commented 4 years ago

@baryluk Let me know when I can test your script.

baryluk commented 4 years ago

@zdavatz

https://github.com/baryluk/covid_19/tree/master/scrapers

Only tested on Linux.

curl, and pdftotext (from poppler-utils) needed, plus standard Unix utilities. Will probably work on other oses, like BSD, macOS, and WSL. And it requires Python 3.8. If you can't update to 3.8, I can port it to work on previous versions of Python 3, but would prefer not to.

baryluk commented 4 years ago

I did setup a cronjob to run every 1 hour on my server: https://www.functor.xyz/covid_19/scrapers/outputs/

It will only create new files, if there is a difference with previous scrape. It will also create diff files, so incremental differences can be seen easily.

That is just a temporary solution obviously, and more integration is needed with this repo, and csv files.

But is good enough for the start.

I will write few more scrapers soon. But I would not be able to write all, as some cantons (for example Appenzell Innerrhoden), really don't make it easy.

zdavatz commented 4 years ago

great work!

zdavatz commented 4 years ago

@zdavatz

https://github.com/baryluk/covid_19/tree/master/scrapers

Only tested on Linux.

curl, and pdftotext (from poppler-utils) needed, plus standard Unix utilities. Will probably work on other oses, like BSD, macOS, and WSL. And it requires Python 3.8. If you can't update to 3.8, I can port it to work on previous versions of Python 3, but would prefer not to.

Will pull now!

zdavatz commented 4 years ago

~/.software/covid_19/scrapers> ./meta_scrape.sh 
AG - - - FAILED
BE - - - FAILED
BS - - - FAILED
GE - - - FAILED
GR - - - FAILED
JU - - - FAILED
LU - - - FAILED
NE - - - FAILED
SH - - - FAILED
TG - - - FAILED
TI - - - FAILED
UR - - - FAILED
VS - - - FAILED
XX - - - FAILED
ZH - - - FAILED

Python 3.7.3, Gentoo Linux. Single tools work.

baryluk commented 4 years ago

It must be Python 3.8.0+. It is using assignment expression for regexp, to make code shorter. I will rewrite it to work in previous Python versions too. For your convenience.

zdavatz commented 4 years ago

ok, thank you. I could also try to compile python-3.8 from source on Gentoo.

This works with 3.7.3

~/.software/covid_19/scrapers> ./scrape_zh.sh 
ZH
Scraped at: 2020-03-21T22:42:47+01:00
Date and time: 20.3.2020, 16.30
Confirmed cases: 773

baryluk commented 4 years ago

@zdavatz Pull again, and try again running meta_scrape.sh. Should work with python 3.7 now.

zdavatz commented 4 years ago

works great, thank you!

~/.software/covid_19/scrapers> ./meta_scrape.sh
AG 2020-03-20T15:00     168       - OK 2020-03-21T22:48:18+01:00
BE 2020-03-21T          377       3 OK 2020-03-21T22:48:18+01:00
BS 2020-03-21T10:00     299       - OK 2020-03-21T22:48:19+01:00
GE 2020-03-20T08:00     873       7 OK 2020-03-21T22:48:19+01:00
GR 2020-03-20T          213       3 OK 2020-03-21T22:48:20+01:00
JU 2020-03-21T18:00      49       - OK 2020-03-21T22:48:21+01:00
LU 2020-03-21T11:00     109       - OK 2020-03-21T22:48:21+01:00
NE 2020-03-21T15:30     177       2 OK 2020-03-21T22:48:21+01:00
SH 2020-03-20T           14       - OK 2020-03-21T22:48:21+01:00
TG 2020-03-21T           56       - OK 2020-03-21T22:48:22+01:00
TI 2020-03-21T08:00     918      28 OK 2020-03-21T22:48:22+01:00
UR 2020-03-21T08:00      12       - OK 2020-03-21T22:48:22+01:00
VD - - - FAILED
VS 2020-03-21T          359       9 OK 2020-03-21T22:48:22+01:00
XX - - - FAILED
ZH 2020-03-20T16:30     773       - OK 2020-03-21T22:48:22+01:00

how do I use parse_scrape_output.py

baryluk commented 4 years ago

parse_scrape_output.py is just a small script that transforms each scraper output into a single line.

From:

ZH
Scraped at: 2020-03-21T22:42:47+01:00
Date and time: 20.3.2020, 16.30
Confirmed cases: 773

into:

ZH 2020-03-20T16:30 773 - OK 2020-03-21T22:48:22+01:00

It is automatically called in meta_scrape.sh on each individual scraper. So you don't need to call it manually.

you can do: ./scrape_zh.sh | ./parse_scrape_output.py if you want tho.

zdavatz commented 4 years ago

Ok, great! And where is the output written to? You can compare the numbers buy using this https://github.com/zdavatz/covid2019_ch_map

baryluk commented 4 years ago

@zdavatz It is written just to standard output. You can save it, usual way like

./meta_scrape.sh | tee current.txt

or

./meta_scrape.sh > current.txt

for example.

I have script does comparison of previous scrape and current scrape, and do diffing, it preserves the full history as of now.

https://www.functor.xyz/covid_19/scrapers/outputs/

I will make a driver for meta_scrape to convert data into csv, and write (append) them into individual files as they are now in the repo.

zdavatz commented 4 years ago

@zdavatz It is written just to standard output. You can save it, usual way like

./meta_scrape.sh | tee current.txt

or

./meta_scrape.sh > current.txt

for example.

works great!

I have script does comparison of previous scrape and current scrape, and do diffing, it preserves the full history as of now.

Great!

https://www.functor.xyz/covid_19/scrapers/outputs/

Please keep your server running, for historical reasons. You could actually create a folder for each day, with a file per canton. Like that the data is always available to check.

I will make a driver for meta_scrape to convert data into csv, and write (append) them into individual files as they are now in the repo.

Ok, great.

baryluk commented 4 years ago

Please keep your server running, for historical reasons. You could actually create a folder for each day, with a file per canton. Like that the data is always available to check.

That is the plan. I should have the pull request ready tomorrow evening with few more scrapers, and integration into csv generation per canton.

baryluk commented 4 years ago

I think I implemented all scrappers that can be implemented. There might be one or two more that could be done, but it could require a lot of fuzzy matching, and possible even machine learning. Too hard.

Few sources are completely missing.

Few are tricky, but doable, and are working for a moment.

I did summarize status in:

https://github.com/baryluk/covid_19/blob/master/scrapers/STATUS.md

and in

https://github.com/baryluk/covid_19/blob/master/scrapers/TODO.md

baryluk commented 4 years ago

The direct link to the latest results can be found here:

https://www.functor.xyz/covid_19/scrapers/outputs/latest.txt

Example:

AG 2020-03-20T15:00     168       - OK 2020-03-23T16:05:03+01:00
BE 2020-03-23T          470       5 OK 2020-03-23T16:05:04+01:00
BL 2020-03-23T14:00     302       - OK 2020-03-23T16:05:06+01:00
BS 2020-03-23T10:00     376       - OK 2020-03-23T16:05:07+01:00
GE 2020-03-20T08:00     873       7 OK 2020-03-23T16:05:08+01:00
GL 2020-03-22T13:30      31       - OK 2020-03-23T16:05:08+01:00
GR 2020-03-22T          266       6 OK 2020-03-23T16:05:09+01:00
JU 2020-03-22T17:00      51       - OK 2020-03-23T16:05:09+01:00
LU 2020-03-23T11:00     156       - OK 2020-03-23T16:05:10+01:00
NE 2020-03-22T15:00     188       2 OK 2020-03-23T16:05:10+01:00
NW 2020-03-22T16:25      36       - OK 2020-03-23T16:05:11+01:00
SG 2020-03-23T          185       - OK 2020-03-23T16:05:11+01:00
SH 2020-03-23T           30       - OK 2020-03-23T16:05:12+01:00
SZ - - - FAILED
TG 2020-03-23T           81       - OK 2020-03-23T16:05:13+01:00
TI 2020-03-23T08:00    1165      48 OK 2020-03-23T16:05:13+01:00
UR 2020-03-21T08:00      12       - OK 2020-03-23T16:05:14+01:00
VD 2020-03-22T         1782      16 OK 2020-03-23T16:05:14+01:00
VS 2020-03-23T          492       2 OK 2020-03-23T16:05:15+01:00
XX - - - FAILED
ZH 2020-03-23T09:30    1068       - OK 2020-03-23T16:05:15+01:00

Yes, SZ is broken. Can't do much about it, as SZ publishes some number in separate PDFs every day, but not actually on every day.

The directory on the server https://www.functor.xyz/covid_19/scrapers/outputs/ contain all historical unique scrapes, and differences between each file, so it can be seen easily, progression of updates.

I see somebody already started working on github actions to integrate scripts into github. Good work.

ebeusch commented 4 years ago

If we can't get SZ to publish on their website, I have no problem with taking responsibility to collect their data manually.

baryluk commented 4 years ago

@ebeusch Do you know where to look for it?

https://www.sz.ch/behoerden/information-medien/medienmitteilungen/coronavirus.html/72-416-412-1379-6948 provides documents in PDF form, but not all of them provide numbers.

The latest I could find with numbers is this document from 15. March: https://www.sz.ch/public/upload/assets/45590/MM_Coronavirus_15_3_2020.pdf - saying there is 13 cases. (it is parsed correctly by the scraper)

The one from 17. March, doesn't have numbers (and thus the scraper fails).

It is now almost 8 days, without an official update on numbers from Kanton Schwyz, which is a bit worrying, to say the least. Maybe we should ask the BAG / FOPH for the data? They do have it somewhere in their documents clearly.

https://www.bag.admin.ch/dam/bag/de/dokumente/mt/k-und-i/aktuelle-ausbrueche-pandemien/2019-nCoV/covid-19-lagebericht.pdf.download.pdf/COVID-19_Epidemiologische_Lage_Schweiz.pdf shows 45.9 per 100k pop cases. So with 160k population, that would be 73 cases as of today (23.03.2020, 8:12 Uhr).

Maybe a good idea is to actually try and parse this pdf from BAG and do a bit of OCR on it for all cantons. I will start grabbing these PDFs every few hours ( in https://www.functor.xyz/covid_19b/ ) for historical reference, just in case.

ebeusch commented 4 years ago

@baryluk very good point. embarrassingly, I didn't check how their website is now (I was the one who found those old pdfs though) I will write to their contact e-mail and ask if they can publish numbers or mail them (as far as I understand that's what AI is doing?)

Considering that BAG/FOPH has different numbers at times, trying to parse their PDFs would at the very least be interesting to compare with the numbers here, I think.

zdavatz commented 4 years ago

outstanding work @baryluk !!!

zdavatz commented 4 years ago

@baryluk can you make one script, that combines meta_scrape.sh so it also runs latest_per_canton.sh or latest_total.sh?

zdavatz commented 4 years ago

@baryluk also can you publish the latest sqlite DB on your Server with all the data that you scraped? Like that other can just download the sqlite DB from your Server.

baryluk commented 4 years ago

@ebeusch :

@baryluk very good point. embarrassingly, I didn't check how their website is now (I was the one who found those old pdfs though) I will write to their contact e-mail and ask if they can publish numbers or mail them (as far as I understand that's what AI is doing?)

Ok. Please e-mail them and ask.

baryluk commented 4 years ago

@zdavatz Yes, I can make sqlite and CSV export of all data (including already scraped historical data). Give me 40 minutes.

zdavatz commented 4 years ago

@baryluk Ok, great! Let me know when I can pull and test. I love it, that you build an sqlite DB with all the data, also the one already scraped. Like this we can generate a new DB after every scraping or once a day.

baryluk commented 4 years ago

@zdavatz

https://www.functor.xyz/covid_19/scrapers/outputs/scrapes.sqlite https://www.functor.xyz/covid_19/scrapers/outputs/scrapes.csv (same data, just in CSV format, with header, and sorted by date, abbr, time).

It contains a merge of all data , including all already scraped.

I can still adjust it, if you want. Like split into separate files per canton, or different format.

baryluk commented 4 years ago

@zdavatz

Oh, something is not quite right with this data. There are some repeating rows. Hmm, I am not sure why. I think this is probably because time is NULL for some data / cantons.

zdavatz commented 4 years ago

Oh, something is not quite right with this data. There are some repeating rows. Hmm, I am not sure why. I think this is probably because time is NULL for some data / cantons.

Ok, please double check.

baryluk commented 4 years ago

Oh, something is not quite right with this data. There are some repeating rows. Hmm, I am not sure why. I think this is probably because time is NULL for some data / cantons.

Ok, please double check.

@zdavatz Fixed by using SELECT DISTINCT. The csv should now have unique rows.

zdavatz commented 4 years ago

@zdavatz Fixed by using SELECT DISTINCT. The csv should now have unique rows.

Looks great! Please keep style, location and name of the CSV and sqlitedb as it is. You could add a version to the file, so if you update the format of the csv/sqlite db you just do it under a new filename (version).

baryluk commented 4 years ago

@zdavatz Absolutely.

zdavatz commented 4 years ago

@baryluk supercool! Like this we really are monitoring all cantonal government website, making sure that we get notified if and when something changes. See: https://github.com/openZH/covid_19/issues/115#issuecomment-602788983

baryluk commented 4 years ago

Twitter bot would be nice. Let me try the mail notification first. If that works, we can extend it to open an issue on github using API.

Also I want to merge my few other minor scripts into this repo, so it is easier for you to run it, in case my server or internet is down, or something happens to me.

zdavatz commented 4 years ago

@baryluk lets hope not, but yes I agree. After each run, you could write a report file in html for example. We do something similar here, for other public domain data: http://pillbox.oddb.org/amiko_report_de.html (swiss drug information required by law). After every run and parsing lots of sources, we create a report file.

tlorusso commented 4 years ago

I think we are at a point where we have quite a good overview over the updating / scraping process with:

https://www.web.statistik.zh.ch/covid19_dashboard/index.html#/ & https://www.functor.xyz/covid_19/scrapers/outputs/

Any reason to leave this issue open @metaodi @baryluk ?

metaodi commented 4 years ago

I agree we have a good overview. From my point-of-view the only thing I'd like to change is the need to check all those places by myself.

I get an error mail for every failed run of the scraper frlm GitHub. But you don't (I guess), so to automate that would be my goal. No matter if it's an email, a chat bot or a new ticket on GitHub.

zdavatz commented 4 years ago

go for it.

baryluk commented 4 years ago

I think the dashboard is excellent, and provides the needed monitoring.

The only thing we need, is detecting optional data regression from scrapers. I.e. one day scraper for canton CC providing numbers of deaths or ICU, but not providing them (even the same) next day (for whatever reasons - broken scraper, missing data on the original website, etc). Right now, we will just consumed what we have, and possibly silently miss the data that is there, just was not scraped due to format change. The only field that we strictly require in the current workflow is day and number of confirmed cases. We should at least monitor some kind of warnings, when we know number for X (deaths, or hospitalizations) should be there, but were not. I think this should be dealt in a separate bug tho.

jb3-2 commented 4 years ago

I have the impression that with the Scraper Status board (https://github.com/openZH/covid_19/blob/master/scrapers/STATUS.md) we have a good overview what's going on, in addition to the manual review by our colleagues. Agree if I close the issue?

openZH / covid_19

better monitoring of the updating process #68