Closed baryluk closed 4 years ago
If I understand correctly, the only info in the text that is not present in the csv is the number of positive tests from people that live outside of Basel-Stadt. Do you agree? Do other Cantons report this number, too? Unsure how this can be tracked:
BTW @tlorusso @metaodi I'm working on getting the Gesundheitsdepartement BS to publish the data in a table so it can be scraped more easily.
Today version.
Das Gesundheitsdepartement Basel-Stadt meldet mit Stand Freitag, 27. März 2020, 10 Uhr, insgesamt 534 positive Fälle von Personen mit Wohnsitz im Kanton Basel-Stadt und einen weiteren Todesfall.
Mit Stand Freitag, 27. März 2020, 10 Uhr, liegen insgesamt 534 positive Fälle von Personen mit Wohnsitz im Kanton Basel-Stadt vor. Dies sind 29 mehr als am Vortag. 191 Personen der 534 positiv Getesteten und damit mehr als ein Drittel sind wieder genesen.
Im Kanton Basel-Stadt liegt ein weiterer Todesfall aufgrund einer Covid-19-Infektion vor. Die verstorbene Patientin gehörte wiederum zur Risikogruppe (älter als 65 und bestehende Vorerkrankung). Die Zahl der Todesfälle im Kanton Basel-Stadt beträgt nunmehr insgesamt dreizehn.
Im Kanton Basel-Stadt werden nebst den Tests der Kantonsbewohnerinnen und -bewohner auch Tests von Verdachtsfällen aus anderen Schweizer Kantonen und dem grenznahen Ausland durchgeführt. Bisher sind die Tests von 877 Personen positiv ausgefallen (inklusive der 534 Basler Fälle).
Aktuell befinden sich total 76 Personen aufgrund einer Covid-19-Infektion in Spitalpflege in einem baselstädtischen Spital. 57 davon sind Einwohnerinnen und Einwohner des Kantons Basel-Stadt. Insgesamt acht Personen benötigen Intensivpflege. Die anderen Patientinnen und Patienten befinden sich auf der normalen Station.
I am talking about the sentence:
Die Zahl der Todesfälle im Kanton Basel-Stadt beträgt nunmehr insgesamt dreizehn.
mit Wohnsitz
im Kanton Basel-Stadt sowie drei weitere Todesfälle.
58 erkrankte Baslerinnen und Basler
sind aktuell aufgrund einer Infektion mit Covid-19 (Coronavirus)
hospitalisiert.
Is it even possible to scrape these numbers from the text? That would definitely be helpful, but I suppose it's complex.
@andreasamsler @tlorusso Up to now I have only added the number of deceased people that lived in BS into the column ncumul_deceased in the sheet. Is that consistent with what other cantons do? Since the BS University Hospital also has cases from other cantons as well as probably people living in Germany or France, that should be the correct way, shouldn't it?
About number of hospitalized people: I will look again back at all the source links and check out if I have only added people living in Basel in column ncumul_hosp. We don't want people who live outside of BS in the BS worksheet, because probably (?) these people are reported by their home canton - or aren't they?
@jb3-2 this last point is related to #187 and needs to be resolved.
When it comes to the number of hospitalized cases in the ncumul_hosp
-column, i really think we should take the number of patients who are currently hospitalized in a canton independent of their residency. From a resources-point of view every patient counts. And - as far as i can judge - double-reporting should not be to much of a risk here, i haven't seen other Cantons besides BS yet that have reported figures this way. The reporting in other Cantons seems to be mostly hospital-centered and not residency-centered. I guess in the case of BS this differentiation has to be made because the number of non-resident patients is higher than in other regions of Switzerland and it might even include foreign patients. This could be similar for TI and GE, but i haven't seen any explicit statements / differentiations going in that direction there.
So my proposal is:
total hospitalized -> ncumul_hosp
additional column for BS-residents in hospitals in BS-File only (after the source column) -> ncumul_hosp_resident
deceased -> here i'm ambiguos. Should the number report patients who deceased in hospitals in BS or residents of BS who fell ill and deceased? I assume in most Cantons the reporting is hospital-centered, which might be a reason why to stick to the first definition.
confirmed cases -> leave it as it is for BS (just residents), here the risk of double counting might indeed by higher.
We could add the ncumul_positive_foreign_canton
for confirmed cases of non residents that you've suggested @jb3-2? Just to the BS-File, after the source column. Might not be the most important information, but just for the sake of completeness we could include it. The alternative to keep things easier would be to skip this info.
That been said, it might make sense to highlight in the Readme that definitions and the way cases are counted can differ among cantons. This also because its might not be possible to get precise definitions by all Cantons as quick as we would need them.
wdyt @jb3-2 @metaodi @zukunft ?
Daily bulletins from BS continue to be a pain to scrape:
Das Gesundheitsdepartement Basel-Stadt meldet mit Stand Samstag, 28. März 2020, 10 Uhr, insgesamt 573 positive Fälle von Personen mit Wohnsitz im Kanton Basel-Stadt.
Mit Stand Samstag, 28. März 2020, 10 Uhr, liegen insgesamt 573 positive Fälle von Personen mit Wohnsitz im Kanton Basel-Stadt vor. Dies sind 39 mehr als am Vortag. Der Kanton Basel-Stadt verzeichnet unverändert dreizehn Todesfälle. 211 Personen der 573 positiv Getesteten und damit mehr als ein Drittel sind wieder genesen.
Technically we could capture this dreizehn
and so on, and do something with it, but it will probably break by Monday.
EDIT: Actually, it is impossible. Format changes too frequently, and sometimes daily bulletin report only a change from previous bulletin, not an absolute number. :(
@tlorusso I agree. I think it is irrelevant where the person is from (different canton, different country, etc). Virus doesn't care. We track not number of residents of canton X being infected, hospitalized. But number of infected, hospitalized people in the canton X.
@jb3-2 Do you have personal contact to people who manage this data, or website? A table, or at least numbers are present (with time and hour of a report) directly on https://www.coronavirus.bs.ch/ , would help both us, and residents of Basel quite a bit.
For "ncumul_hosp" and "ncumul_hosp_resident" this makes sense to me. For "deceased" I would assume, that this is also hospital-centered. Based on last weeks reactions, asking a short question how this is handled seems to be not (yet) a problem. Based on the answers, we can add "ncumul_positive_foreign_canton" if needed. I can start asking and post the answers here.
@baryluk No I don't have personal contact yet - I will continue to update the table manually every day until all relevant numbers can be scraped somehow.
@zukunft A new column "ncumul_positive_foreign_canton" is a good idea, but I would rather name it "ncumul_positive_non_resident" or so, because in BS we probably have people from France and Germany, not only from other cantons. What do you think?
@jb3-2 Agreed, let's name it "ncumul_positive_non_resident"
hi guys. i ll try some machine learning / nlp on those texts to see if we could parse them that way. will have an update on my attempts by tomorrow. cheers.
@jb3-2 @baryluk @metaodi @zukunft We've recieved feedback from the department of health of Basel Stadt. They suggest we either report total confirmed / total hospitalized OR total confirmed residents and total hospitalized residents to stay consistent.
I'd say we go with totals and the additional columns for the Basel-File only: ncumul_confirmed_non_resident
, ninst_hosp_non_resident
.
OK I am starting to do this right now in a branch.
ZG just confirmed that the "ncumul_conf" are the residents. E.g. if a person with resident ZG is in TI in the hospital, the case would be included in the ZG numbers. So the new columns make sense to avoid double counting. Same for FL. I will add more, once I receive the answers in #238 .
I have added the two columns to the canton BS file, corrected the values, and added new values from the BS press releases, see this branch: https://github.com/openZH/covid_19/tree/Enhance-BS-data---issue-193
@metaodi Apparently the script that merges all canton files into the CH file does not work yet correctly when new columns are added, the CH file was inconsistent for a short amount of time, and BS data was wrong. I see that you have rolled back my changes (removed the two new columns) so that the CH file works again. Can I somehow help making sure nothing breaks while still keeping the new columns in the BS file?
Hi all, not sure this is needed but managed to build a parser for the BS daily bulletin, including cases of "vierzehn" etc. happy to provide the code/model if you feel it could be helpful.
also happy to help on other fronts if someone wants to point me somewhere specific.
cheers and thanks for your work!
@abieler Could you share a code? I don't think we will use much, but would be nice to see what you got. Thanks.
@baryluk you can take a look here: https://github.com/abieler/bs-bulletin-parser
if you think it could be useful I d be happy to convert to a proper MR here.
cheers
@abieler Wow that looks pretty concise, thank you very much! Have you checked whether it returns the same numbers for BS that I manually entered today? I'd be very happy if you could create a PR so that I have less manual work to do... One complication: There are two new columns at the end of the data file - they are created by subtraction of numbers delivered in the press release. Would that be possible to implement using your code base? https://github.com/openZH/covid_19/blob/master/fallzahlen_kanton_total_csv/COVID19_Fallzahlen_Kanton_BS_total.csv
@jb3-2 added the two extra columns, the numbers for today do check out
(it missed numcul_deceased
though) these guys come up with some creative writing...
NUMCUL_CONF_RESIDENTS : 718
DAILY_CONF : 27
NUMCUL_RELEASED : 350
NUMCUL_HOSP : 119
NUMCUL_DECEASED : 92
NUMCUL_CONF : 1127
NUMCUL_HOSP_RESIDENTS : 95
NUMCUL_ICU : 17
NCUMUL_CONFIRMED_NON_RESIDENT : 409
NINST_HOSP_NON_RESIDENT : 24
PR will have to wait for tomorrow though. my day job is keeping me busy for the rest of the day.. :)
Wow, that is awesome!! Thank you so much, greatly appreciated. I hope I can learn a lot from your code.
@abieler Pretty cool. Would be nice to integrate it. If you could push it into scrapers/scrape_bs/
, with main driver still in executable file scrapers/scrape_bs.sh
(with proper #!
in the first line, just like now), that would be great.
The technical details of the nlp thing is above my head right now, but if it works then it works :)
Yay, scraper runs pretty well now! Some slight problems remain:
I compared the scraper commit https://github.com/openZH/covid_19/commit/6b039956d564888eb226493d051b47f5c76e5f68 with my manual correction https://github.com/openZH/covid_19/commit/98db9d7a6976c4c200551ebad48adc6dbef7d3c1.
Thanks for all your efforts - amazing work!
@jb3-2 @baryluk I just submitted the PR https://github.com/openZH/covid_19/pull/440
happy to make changes if you guys deem necessary. (Especially what to print out as report etc.
or the suggested scrapers/scrape_bs/
structure instead of my scrapers/utils/bs
@baryluk ?)
The model got an update and should work better now.
There is still an issue with NUMCUL_DECEASED, the sentencizer fails on some reports and as a result the number is not extracted.
Also I think I have a naming inconsistency when compared to what you already have:
NUMCUL_CONF_RESIDENTS
in this scraper is what i think you guys have as NUMCUL_CONF
. If you confirm, i ll change accordingly on my end.
Right now the nlp model sits in the utils/nlp
folder, this is probably not the best place for it to go as it significantly increases the size of your repository (+3.8MB). Do you have a separate place to put larger chunks of data such as GCS or S3?
cheers andre
@abieler Looking into it now, and testing.
BTW. https://github.com/openZH/covid_19/commit/7a91c50ac3b11f1f93d00f73770b5dc468993d71 added extraction of hospitalized and ICU on 2020-04-04. Nice. I somehow missed it, as it didn't reference this bug. PR https://github.com/openZH/covid_19/pull/422
Also as of yesterday, it is rather easy to extract deceased numbers from the text, as they are numbers, I have a code for this here: https://github.com/baryluk/covid_19/commit/3d3af25c5ee5a2bc79ba787f57fb232ceddbbf18
The scraper is now picks up much more data than before, but there's a wrong number for ncumul_hosp, see https://github.com/openZH/covid_19/commit/1b3a4846db272196e83a2ab1a347ab53430fd1a6 (I had to manually correct that). Also the last two numbers in the csv are still not filled in by the scraper. They need to be calculated based on numbers that can be scraped, see my comment here. Any help greatly appreciated, thank you!
@jb3-2 Thanks for your feedback and spotting the issue. Will try to develop a fix quickly.
@jb3-2 Your previous comments were very useful fixing it! I did miss some of them before.
Does this make more sense for today:
$ ./scrape_bs.py
BS
Downloading: https://www.gd.bs.ch/
Downloading: https://www.gd.bs.ch//nm/2020-tagesbulletin-coronavirus-834-bestaetigte-faelle-im-kanton-basel-stadt-gd.html
Scraped at: 2020-04-08T12:20:45.178344+02:00
Date and time: 8. April 2020, 10 Uhr
Confirmed cases: 834 # ncumul_conf
Recovered: 535 # ncumul_released
Hospitalized: 99 # ncumul_hosp
ICU: 14 # ncumul_ICU
Deaths: 31 # ncumul_deceased
Confirmed cases (residents): 834
Confirmed cases (non-residents): 459 # ncumul_confirmed_non_resident
Confirmed cases (all): 1293
Hospitalized (non-residents): 16 # ninst_hosp_non_resident
Hospitalized (residents): 83
$
Is this correct? Please double check me.
I noticed, that ncumul_conf
now only tracks residents in this repo? So it should be 834 in the CSV?
That is worrying, because ncumul_confirmed_non_resident
(and ninst_hosp_non_resident
) columns are not actually documented in the README.md
.
Does one need to sum all ncumul_conf
and ninst_hosp_non_resident
to actually get full number of confirmed cases at any given instance?
Sorry, if this was asked somewhere else, I probably missed it.
But due to the split, people might be getting actually the wrong picture if they don't know about these columns!
Wow this looks perfect, thank you so much!
ncumul_conf
only tracks residents, as noted in the readme.md:
ncumul_confirmed_non_resident
and ninst_hosp_non_resident
are only present in the BS data file. We wanted to add all infos that are contained in the press releases to the csv. Do you think they need more explanation? ncumul_confirmed_non_resident
and ncumul_conf
to get the total number of positive cases that are tested in BS but some of them live in another canton or country, and these cases are reported in the other canton or country. So we don't want to double-count cases, that's why we report only positive cases that reside in the canton. For number of people in hospitals it's different because here we want to get an overview of how many patients vs. hospital capacity, so canton of residence does not matter too much here. I hope my explations clear it up a bit...?@baryluk Can please create a PR so your scraper can go live? Again, thanks for all your hard work, greatly appreciated!
@jb3-2
ncumul_confirmed_non_resident and ninst_hosp_non_resident are only present in the BS data file. We wanted to add all infos that are contained in the press releases to the csv. Do you think they need more explanation?
I think yes, they should be explained in README.md
.
* You can add `ncumul_confirmed_non_resident` and `ncumul_conf` to get the total number of positive cases that are tested in BS but some of them live in another canton or country, and **these cases are reported in the other canton** or country. So we don't want to double-count cases, that's why we report only positive cases that reside in the canton. [...]
This is all very messy.
How one reliably compute current total number of confirmed cases in Switzerland, using cantonal data?
@baryluk Can please create a PR so your scraper can go live?
I wish I could, but I don't understand the logic of these statistics yet.
There are some extra information in daily bulletins now:
https://www.gd.bs.ch/nm/2020-tagesbulletin-coronavirus-466-bestaetigte-faelle-im-kanton-basel-stadt-gd.html
I don't think adding it now, would be reliable (free form text, including numbers as words), but some can be done, if possible.
This is a tracking bug.