kforeman / covid19-scraper

MIT License
7 stars 7 forks source link

Create parser for https://mhlw-gis.maps.arcgis.com/apps/opsdashboard/index.html#/0c5d0502bbb54f9a8dddebca003631b8 #54

Open szelenka opened 4 years ago

szelenka commented 4 years ago

Create a web scraper for url:

https://mhlw-gis.maps.arcgis.com/apps/opsdashboard/index.html#/0c5d0502bbb54f9a8dddebca003631b8

This Issue is to create a parser to run daily via Prefect.

Country Organization data_type nid
Japan Ministry of Health, Labor and Welfare curation

Preliminary data available

dixitaayush8 commented 4 years ago

I'll do this.

dixitaayush8 commented 4 years ago

This is a dynamic JS website, so the website takes a while to load and render data and the div tags constantly change till the website finishes loading. The requests library usually scrapes the first HTML it sees and it only returned the "loading" divs for this one.

To solve this, I inspected XHR calls as the website loaded to determine where website queries its data from. I found where it queries its data from and ran the correct query to obtain all the cases data. Verified the data returned here with the data displayed on the website. This is the website it queries data from, so I'm scraping this: https://services8.arcgis.com/JdxivnCyd1rvJTrY/ArcGIS/rest/services/covid19_list_csv_EnglishView/FeatureServer/0/query?where=1%3D1&objectIds=&time=&geometry=&geometryType=esriGeometryEnvelope&inSR=&spatialRel=esriSpatialRelIntersects&resultType=none&distance=0.0&units=esriSRUnit_Meter&returnGeodetic=false&outFields=&returnGeometry=true&featureEncoding=esriDefault&multipatchOption=xyFootprint&maxAllowableOffset=&geometryPrecision=&outSR=&datumTransformation=&applyVCSProjection=false&returnIdsOnly=true&returnUniqueIdsOnly=false&returnCountOnly=false&returnExtentOnly=false&returnQueryGeometry=false&returnDistinctValues=false&cacheHint=false&orderByFields=&groupByFieldsForStatistics=&outStatistics=&having=&resultOffset=&resultRecordCount=&returnZ=false&returnM=false&returnExceededLimitFeatures=true&quantizationParameters=&sqlFormat=standard&f=html&token=.

Can use BeautifulSoup and requests library to obtain updated number of total cases and cases by subnational, age, region, sex, and dates from the website now. 👍