Closed konradkalemba closed 4 years ago
For me, the idea of scraping official data is a good one. But is it sense to scrape them to create the same map? I would stay with the official account of the MZ twitter.
I like the data with cities more. They are more readable.
You would also need to prepare the same map as it is on gov.pl, or use coordinate location by place name.
there are cities specified in some cases, in others there are "powiaty".
sometimes powiat is named same as city. theese are "miasta na parawach powiatu". see https://pl.wikipedia.org/wiki/Lista_powiat%C3%B3w_w_Polsce or TERYT database (http://eteryt.stat.gov.pl/eTeryt/rejestr_teryt/udostepnianie_danych/baza_teryt/uzytkownicy_indywidualni/wyszukiwanie/wyszukiwanie.aspx?contrast=default)
@mhajder There is also an option to stay with manually updated data, but we would have to create a team responsible for data updating, because right now I'm not available all the time. With more people, delays in data update would be lower.
@mulawamichal I see - that's good, we wouldn't have to deal with showing "powiat" on map.
In both approaches there are some trade-offs:
In manual data update with more people someone would have to be online, which isn't the problem with big team. But the problem is to assemble such team.
In automatic scraping we would lose source of individual cases and of course there is some time to implement such scraper.
@konradkalemba I can help with adding data.
Implementing such a scraper is very simple. It's best to do a simple python script that will be run in cron. Also, page scraping is not very ethical. It generates a lot of traffic.
@mhajder I know that isn't the best way ethical-wise, but it wouldn't generate a lot of traffic though. Running the script every 5 minutes wouldn't hurt the server very much.
An another problem with this data source is that I'm not sure if it's updated regularly
Okay, for the time being we are staying with the data updated manually. Official MZ's website was outdated for at least 1 hour after the latest confirmation.
Okay, for the time being we are staying with the data updated manually. Official MZ's website was outdated for at least 1 hour after the latest confirmation.
Probably the change are only during working hour 馃槅
There is an another problem - MZ's Twitter doesn't specify cities where new cases are, only voivodeships now...
geoportal now has "koronawirus" layer: https://mapy.geoportal.gov.pl/imap/Imgp_2.html?locale=en&gui=new&sessionID=4955220
geoportal now has "koronawirus" layer: https://mapy.geoportal.gov.pl/imap/Imgp_2.html?locale=en&gui=new&sessionID=4955220
But it is extremely difficult to scrape it.
For gov.pl, all you need is:
import json
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.gov.pl/web/koronawirus/wykaz-zarazen-koronawirusem-sars-cov-2')
soup = BeautifulSoup(response.text, 'html.parser')
json_data = json.loads((soup.find(id='registerData')).text)
print(json_data['parsedData'])
@mulawamichal @mhajder Guys, we have a big problem - their official twitter account lists only voivodeships now as I wrote above, but I thought maybe their website will be more precise... but they limited both map and table to voivodeships.
@konradkalemba Gov.pl also now only provides voivodships.
@mhajder Yes... I think we also have to do so. Because trying to find where every new case is will be very time-consuming.
Also correct the case that the patient recovered, as it is written on Twitter this is the first patient. So it is from Zielona G贸ra.
Hi all!
Currently the data is updated manually from MZ Twitter account. However, in the long run this approach is not effective.
I found the official website - https://www.gov.pl/web/koronawirus/wykaz-zarazen-koronawirusem-sars-cov-2 where we can get the data from automatically. Their data are a bit inconsistent though - there are cities specified in some cases, in others there are "powiaty".
There is one more source we scrap the data from - https://docs.google.com/spreadsheets/d/1ierEhD6gcq51HAm433knjnVwey4ZE5DCnu1bW7PRG3E/htmlview?usp=sharing&sle=true
Any thoughts?