openZH / covid_19

COVID19 case numbers of Cantons of Switzerland and Principality of Liechtenstein (FL). The data is updated at best once a day (times of collection and update may vary). Start with the README.
https://www.zh.ch/de/gesundheit/coronavirus/zahlen-fakten-covid-19.zhweb-noredirect.zhweb-cache.html?keywords=covid19&keyword=covid19#/
Creative Commons Attribution 4.0 International
424 stars 176 forks source link

Extract hospitalized and deaths for BS scraper #193

Closed baryluk closed 4 years ago

baryluk commented 4 years ago

There are some extra information in daily bulletins now:

https://www.gd.bs.ch/nm/2020-tagesbulletin-coronavirus-466-bestaetigte-faelle-im-kanton-basel-stadt-gd.html

Das Gesundheitsdepartement Basel-Stadt meldet mit Stand Mittwoch, 25.
März 2020, 10 Uhr, insgesamt 466 positive Fälle von Personen mit Wohnsitz
im Kanton Basel-Stadt sowie drei weitere Todesfälle.

Mit Stand Mittwoch, 25. März 2020, 10 Uhr, liegen insgesamt 466 positive
Fälle von Personen mit Wohnsitz im Kanton Basel-Stadt vor. Dies sind 52
mehr als am Vortag. 128 Personen der 466 positiv Getesteten und somit
über ein Viertel sind wieder genesen. 58 erkrankte Baslerinnen und Basler
sind aktuell aufgrund einer Infektion mit Covid-19 (Coronavirus)
hospitalisiert.

Im Kanton Basel-Stadt werden nebst den Tests der Kantonsbewohnerinnen und
-bewohner auch Tests von Verdachtsfällen aus anderen Schweizer Kantonen
und dem grenznahen Ausland durchgeführt. Bisher sind die Tests von 773
Personen positiv ausgefallen (inklusive der 466 Basler Fälle).

I don't think adding it now, would be reliable (free form text, including numbers as words), but some can be done, if possible.

This is a tracking bug.

jb3-2 commented 4 years ago

If I understand correctly, the only info in the text that is not present in the csv is the number of positive tests from people that live outside of Basel-Stadt. Do you agree? Do other Cantons report this number, too? Unsure how this can be tracked:

BTW @tlorusso @metaodi I'm working on getting the Gesundheitsdepartement BS to publish the data in a table so it can be scraped more easily.

baryluk commented 4 years ago

Today version.

Das Gesundheitsdepartement Basel-Stadt meldet mit Stand Freitag, 27. März 2020, 10 Uhr, insgesamt 534 positive Fälle von Personen mit Wohnsitz im Kanton Basel-Stadt und einen weiteren Todesfall.

Mit Stand Freitag, 27. März 2020, 10 Uhr, liegen insgesamt 534 positive Fälle von Personen mit Wohnsitz im Kanton Basel-Stadt vor. Dies sind 29 mehr als am Vortag. 191 Personen der 534 positiv Getesteten und damit mehr als ein Drittel sind wieder genesen.

Im Kanton Basel-Stadt liegt ein weiterer Todesfall aufgrund einer Covid-19-Infektion vor. Die verstorbene Patientin gehörte wiederum zur Risikogruppe (älter als 65 und bestehende Vorerkrankung). Die Zahl der Todesfälle im Kanton Basel-Stadt beträgt nunmehr insgesamt dreizehn.

Im Kanton Basel-Stadt werden nebst den Tests der Kantonsbewohnerinnen und -bewohner auch Tests von Verdachtsfällen aus anderen Schweizer Kantonen und dem grenznahen Ausland durchgeführt. Bisher sind die Tests von 877 Personen positiv ausgefallen (inklusive der 534 Basler Fälle).

Aktuell befinden sich total 76 Personen aufgrund einer Covid-19-Infektion in Spitalpflege in einem baselstädtischen Spital. 57 davon sind Einwohnerinnen und Einwohner des Kantons Basel-Stadt. Insgesamt acht Personen benötigen Intensivpflege. Die anderen Patientinnen und Patienten befinden sich auf der normalen Station.

I am talking about the sentence:

Die Zahl der Todesfälle im Kanton Basel-Stadt beträgt nunmehr insgesamt dreizehn.
mit Wohnsitz
im Kanton Basel-Stadt sowie drei weitere Todesfälle.
58 erkrankte Baslerinnen und Basler
sind aktuell aufgrund einer Infektion mit Covid-19 (Coronavirus)
hospitalisiert.
jb3-2 commented 4 years ago

Is it even possible to scrape these numbers from the text? That would definitely be helpful, but I suppose it's complex.

@andreasamsler @tlorusso Up to now I have only added the number of deceased people that lived in BS into the column ncumul_deceased in the sheet. Is that consistent with what other cantons do? Since the BS University Hospital also has cases from other cantons as well as probably people living in Germany or France, that should be the correct way, shouldn't it?

About number of hospitalized people: I will look again back at all the source links and check out if I have only added people living in Basel in column ncumul_hosp. We don't want people who live outside of BS in the BS worksheet, because probably (?) these people are reported by their home canton - or aren't they?

metaodi commented 4 years ago

@jb3-2 this last point is related to #187 and needs to be resolved.

tlorusso commented 4 years ago

When it comes to the number of hospitalized cases in the ncumul_hosp-column, i really think we should take the number of patients who are currently hospitalized in a canton independent of their residency. From a resources-point of view every patient counts. And - as far as i can judge - double-reporting should not be to much of a risk here, i haven't seen other Cantons besides BS yet that have reported figures this way. The reporting in other Cantons seems to be mostly hospital-centered and not residency-centered. I guess in the case of BS this differentiation has to be made because the number of non-resident patients is higher than in other regions of Switzerland and it might even include foreign patients. This could be similar for TI and GE, but i haven't seen any explicit statements / differentiations going in that direction there.

So my proposal is:

That been said, it might make sense to highlight in the Readme that definitions and the way cases are counted can differ among cantons. This also because its might not be possible to get precise definitions by all Cantons as quick as we would need them.

wdyt @jb3-2 @metaodi @zukunft ?

baryluk commented 4 years ago

Daily bulletins from BS continue to be a pain to scrape:

Das Gesundheitsdepartement Basel-Stadt meldet mit Stand Samstag, 28. März 2020, 10 Uhr, insgesamt 573 positive Fälle von Personen mit Wohnsitz im Kanton Basel-Stadt.

Mit Stand Samstag, 28. März 2020, 10 Uhr, liegen insgesamt 573 positive Fälle von Personen mit Wohnsitz im Kanton Basel-Stadt vor. Dies sind 39 mehr als am Vortag. Der Kanton Basel-Stadt verzeichnet unverändert dreizehn Todesfälle. 211 Personen der 573 positiv Getesteten und damit mehr als ein Drittel sind wieder genesen.

Technically we could capture this dreizehn and so on, and do something with it, but it will probably break by Monday.

EDIT: Actually, it is impossible. Format changes too frequently, and sometimes daily bulletin report only a change from previous bulletin, not an absolute number. :(

baryluk commented 4 years ago

@tlorusso I agree. I think it is irrelevant where the person is from (different canton, different country, etc). Virus doesn't care. We track not number of residents of canton X being infected, hospitalized. But number of infected, hospitalized people in the canton X.

baryluk commented 4 years ago

@jb3-2 Do you have personal contact to people who manage this data, or website? A table, or at least numbers are present (with time and hour of a report) directly on https://www.coronavirus.bs.ch/ , would help both us, and residents of Basel quite a bit.

zukunft commented 4 years ago

For "ncumul_hosp" and "ncumul_hosp_resident" this makes sense to me. For "deceased" I would assume, that this is also hospital-centered. Based on last weeks reactions, asking a short question how this is handled seems to be not (yet) a problem. Based on the answers, we can add "ncumul_positive_foreign_canton" if needed. I can start asking and post the answers here.

jb3-2 commented 4 years ago

@baryluk No I don't have personal contact yet - I will continue to update the table manually every day until all relevant numbers can be scraped somehow.

@zukunft A new column "ncumul_positive_foreign_canton" is a good idea, but I would rather name it "ncumul_positive_non_resident" or so, because in BS we probably have people from France and Germany, not only from other cantons. What do you think?

zukunft commented 4 years ago

@jb3-2 Agreed, let's name it "ncumul_positive_non_resident"

abieler commented 4 years ago

hi guys. i ll try some machine learning / nlp on those texts to see if we could parse them that way. will have an update on my attempts by tomorrow. cheers.

tlorusso commented 4 years ago

@jb3-2 @baryluk @metaodi @zukunft We've recieved feedback from the department of health of Basel Stadt. They suggest we either report total confirmed / total hospitalized OR total confirmed residents and total hospitalized residents to stay consistent.

I'd say we go with totals and the additional columns for the Basel-File only: ncumul_confirmed_non_resident , ninst_hosp_non_resident.

jb3-2 commented 4 years ago

OK I am starting to do this right now in a branch.

zukunft commented 4 years ago

ZG just confirmed that the "ncumul_conf" are the residents. E.g. if a person with resident ZG is in TI in the hospital, the case would be included in the ZG numbers. So the new columns make sense to avoid double counting. Same for FL. I will add more, once I receive the answers in #238 .

jb3-2 commented 4 years ago

I have added the two columns to the canton BS file, corrected the values, and added new values from the BS press releases, see this branch: https://github.com/openZH/covid_19/tree/Enhance-BS-data---issue-193

@metaodi Apparently the script that merges all canton files into the CH file does not work yet correctly when new columns are added, the CH file was inconsistent for a short amount of time, and BS data was wrong. I see that you have rolled back my changes (removed the two new columns) so that the CH file works again. Can I somehow help making sure nothing breaks while still keeping the new columns in the BS file?

abieler commented 4 years ago

Hi all, not sure this is needed but managed to build a parser for the BS daily bulletin, including cases of "vierzehn" etc. happy to provide the code/model if you feel it could be helpful.

also happy to help on other fronts if someone wants to point me somewhere specific.

cheers and thanks for your work!

baryluk commented 4 years ago

@abieler Could you share a code? I don't think we will use much, but would be nice to see what you got. Thanks.

abieler commented 4 years ago

@baryluk you can take a look here: https://github.com/abieler/bs-bulletin-parser

if you think it could be useful I d be happy to convert to a proper MR here.

cheers

jb3-2 commented 4 years ago

@abieler Wow that looks pretty concise, thank you very much! Have you checked whether it returns the same numbers for BS that I manually entered today? I'd be very happy if you could create a PR so that I have less manual work to do... One complication: There are two new columns at the end of the data file - they are created by subtraction of numbers delivered in the press release. Would that be possible to implement using your code base? https://github.com/openZH/covid_19/blob/master/fallzahlen_kanton_total_csv/COVID19_Fallzahlen_Kanton_BS_total.csv

abieler commented 4 years ago

@jb3-2 added the two extra columns, the numbers for today do check out (it missed numcul_deceased though) these guys come up with some creative writing...

NUMCUL_CONF_RESIDENTS         : 718
DAILY_CONF                    : 27
NUMCUL_RELEASED               : 350
NUMCUL_HOSP                   : 119
NUMCUL_DECEASED               : 92
NUMCUL_CONF                   : 1127
NUMCUL_HOSP_RESIDENTS         : 95
NUMCUL_ICU                    : 17
NCUMUL_CONFIRMED_NON_RESIDENT : 409
NINST_HOSP_NON_RESIDENT       : 24
abieler commented 4 years ago

PR will have to wait for tomorrow though. my day job is keeping me busy for the rest of the day.. :)

jb3-2 commented 4 years ago

Wow, that is awesome!! Thank you so much, greatly appreciated. I hope I can learn a lot from your code.

baryluk commented 4 years ago

@abieler Pretty cool. Would be nice to integrate it. If you could push it into scrapers/scrape_bs/, with main driver still in executable file scrapers/scrape_bs.sh (with proper #! in the first line, just like now), that would be great.

The technical details of the nlp thing is above my head right now, but if it works then it works :)

jb3-2 commented 4 years ago

Yay, scraper runs pretty well now! Some slight problems remain:

See https://www.coronavirus.bs.ch/nm/2020-tagesbulletin-coronavirus-771-bestaetigte-faelle-im-kanton-basel-stadt-gd.html

I compared the scraper commit https://github.com/openZH/covid_19/commit/6b039956d564888eb226493d051b47f5c76e5f68 with my manual correction https://github.com/openZH/covid_19/commit/98db9d7a6976c4c200551ebad48adc6dbef7d3c1.

Thanks for all your efforts - amazing work!

abieler commented 4 years ago

@jb3-2 @baryluk I just submitted the PR https://github.com/openZH/covid_19/pull/440

happy to make changes if you guys deem necessary. (Especially what to print out as report etc. or the suggested scrapers/scrape_bs/ structure instead of my scrapers/utils/bs @baryluk ?)

The model got an update and should work better now. There is still an issue with NUMCUL_DECEASED, the sentencizer fails on some reports and as a result the number is not extracted. Also I think I have a naming inconsistency when compared to what you already have: NUMCUL_CONF_RESIDENTS in this scraper is what i think you guys have as NUMCUL_CONF. If you confirm, i ll change accordingly on my end.

Right now the nlp model sits in the utils/nlp folder, this is probably not the best place for it to go as it significantly increases the size of your repository (+3.8MB). Do you have a separate place to put larger chunks of data such as GCS or S3?

cheers andre

baryluk commented 4 years ago

@abieler Looking into it now, and testing.

baryluk commented 4 years ago

BTW. https://github.com/openZH/covid_19/commit/7a91c50ac3b11f1f93d00f73770b5dc468993d71 added extraction of hospitalized and ICU on 2020-04-04. Nice. I somehow missed it, as it didn't reference this bug. PR https://github.com/openZH/covid_19/pull/422

Also as of yesterday, it is rather easy to extract deceased numbers from the text, as they are numbers, I have a code for this here: https://github.com/baryluk/covid_19/commit/3d3af25c5ee5a2bc79ba787f57fb232ceddbbf18

jb3-2 commented 4 years ago

The scraper is now picks up much more data than before, but there's a wrong number for ncumul_hosp, see https://github.com/openZH/covid_19/commit/1b3a4846db272196e83a2ab1a347ab53430fd1a6 (I had to manually correct that). Also the last two numbers in the csv are still not filled in by the scraper. They need to be calculated based on numbers that can be scraped, see my comment here. Any help greatly appreciated, thank you!

baryluk commented 4 years ago

@jb3-2 Thanks for your feedback and spotting the issue. Will try to develop a fix quickly.

baryluk commented 4 years ago

@jb3-2 Your previous comments were very useful fixing it! I did miss some of them before.

Does this make more sense for today:

$ ./scrape_bs.py 
BS
Downloading: https://www.gd.bs.ch/
Downloading: https://www.gd.bs.ch//nm/2020-tagesbulletin-coronavirus-834-bestaetigte-faelle-im-kanton-basel-stadt-gd.html
Scraped at: 2020-04-08T12:20:45.178344+02:00
Date and time: 8. April 2020, 10 Uhr
Confirmed cases: 834                    # ncumul_conf
Recovered: 535                          # ncumul_released
Hospitalized: 99                        # ncumul_hosp
ICU: 14                                 # ncumul_ICU
Deaths: 31                              # ncumul_deceased
Confirmed cases (residents): 834
Confirmed cases (non-residents): 459    # ncumul_confirmed_non_resident
Confirmed cases (all): 1293
Hospitalized (non-residents): 16        # ninst_hosp_non_resident
Hospitalized (residents): 83
$

Is this correct? Please double check me.

I noticed, that ncumul_conf now only tracks residents in this repo? So it should be 834 in the CSV?

That is worrying, because ncumul_confirmed_non_resident (and ninst_hosp_non_resident) columns are not actually documented in the README.md.

Does one need to sum all ncumul_conf and ninst_hosp_non_resident to actually get full number of confirmed cases at any given instance?

Sorry, if this was asked somewhere else, I probably missed it.

But due to the split, people might be getting actually the wrong picture if they don't know about these columns!

jb3-2 commented 4 years ago

Wow this looks perfect, thank you so much!

@baryluk Can please create a PR so your scraper can go live? Again, thanks for all your hard work, greatly appreciated!

baryluk commented 4 years ago

@jb3-2

ncumul_confirmed_non_resident and ninst_hosp_non_resident are only present in the BS data file. We wanted to add all infos that are contained in the press releases to the csv. Do you think they need more explanation?

I think yes, they should be explained in README.md.

* You can add `ncumul_confirmed_non_resident` and `ncumul_conf` to get the total number of positive cases that are tested in BS but some of them live in another canton or country, and **these cases are reported in the other canton** or country. So we don't want to double-count cases, that's why we report only positive cases that reside in the canton. [...]

This is all very messy.

How one reliably compute current total number of confirmed cases in Switzerland, using cantonal data?

@baryluk Can please create a PR so your scraper can go live?

I wish I could, but I don't understand the logic of these statistics yet.