mattkerlogue / google-covid-mobility-scrape

Script for scraping Google's COVID19 Community Mobility Reports [ARCHIVED]
MIT License
33 stars 14 forks source link

Invoke error for national/regional URL input #5

Open matt-dray opened 4 years ago

matt-dray commented 4 years ago

Problem

You can, for example, pass a national URL to get_subnational_data() and no error is raised. The function manages to extract data and theregion column gets filled with Mobility Report en.pdf (because this variable is filled using a str_split() index).

Example

Passing the GB PDF to get_subregion_data().

get_subregion_data("https://www.gstatic.com/covid19/mobility/2020-04-05_GB_Mobility_Report_en.pdf")
## A tibble: 900 x 6
#   date       country region                 location      entity         value
#   <chr>      <chr>   <chr>                  <chr>         <chr>          <dbl>
# 1 2020-04-05 GB      Mobility Report en.pdf Aberdeen City retail_recr   -0.84 
# ...

Solution

Detect the input as the path to a national or regional file. Could be based on the number of str_split() elements, but this will depend on the consistency of the URL format.

length(str_split("2020-04-05_US_Alabama_Mobility_Report_en.pdf", "_")[[1]])  # 6 elements
length(str_split("2020-04-05_GB_Mobility_Report_en.pdf", "_")[[1]])  # 5 elements

Or perhaps there's an element in the PDFs themselves that can help identify whether it's national or subnational.

Risk

Minimal. Perhaps only a problem if a third party uses the function incorrectly.