The goal of scrappy is to provide simple functions to scrape data from different websites for academic purposes.
You can install the released version of scrappy from CRAN with:
install.packages("scrappy")
And the development version from GitHub with:
# install.packages("devtools")
devtools::install_github("villegar/scrappy")
NOTE: To run the following examples on your computer, you need to
download and install Mozilla Firefox
(https://www.mozilla.org/en-GB/firefox/new/). Alternatively, you can
replace the value of browser
in the call to RSelenium::rsDriver
.
The Network for Environment and Weather Applications at Cornell University. Website: http://newa.cornell.edu
# Create RSelenium session
rD <- RSelenium::rsDriver(browser = "firefox", port = 4549L, verbose = FALSE)
# Call scrappy
out <- scrappy::newa_nrcc(
client = rD$client,
year = 2020,
month = 12, # December
station = "gbe", # Geneva (Bejo) station
save_file = FALSE
) # Don't save output to a CSV file
# Stop server
rD$server$stop()
Partial output from the previous example:
Date/Time | Air Temp (℉) | Precip (inches) | Leaf Wetness (minutes) | RH (%) | Wind Spd (mph) | Wind Dir (degrees) | Solar Rad (langleys) | Dewpoint (℉) | Station |
---|---|---|---|---|---|---|---|---|---|
12/31/2020 23:00 EST | 33.1 | 0 | 0 | 82 | 2.8 | 264 | 0 | 28 | gbe |
12/31/2020 22:00 EST | 33.0 | 0 | 0 | 80 | 3.3 | 250 | 0 | 28 | gbe |
12/31/2020 21:00 EST | 32.8 | 0 | 0 | 81 | 2.6 | 261 | 0 | 28 | gbe |
12/31/2020 20:00 EST | 32.5 | 0 | 0 | 84 | 1.7 | 277 | 0 | 28 | gbe |
12/31/2020 19:00 EST | 32.9 | 0 | 0 | 81 | 2.1 | 279 | 0 | 28 | gbe |
12/31/2020 18:00 EST | 33.3 | 0 | 0 | 79 | 3.0 | 272 | 0 | 28 | gbe |
12/31/2020 17:00 EST | 33.5 | 0 | 0 | 78 | 3.9 | 274 | 1 | 27 | gbe |
12/31/2020 16:00 EST | 34.1 | 0 | 0 | 74 | 4.9 | 272 | 7 | 27 | gbe |
12/31/2020 15:00 EST | 33.8 | 0 | 0 | 72 | 7.1 | 277 | 8 | 26 | gbe |
12/31/2020 14:00 EST | 34.4 | 0 | 0 | 70 | 7.9 | 276 | 13 | 26 | gbe |
Extract the reviews for Sefton Park in Liverpool (only the 20 most recent):
# Create RSelenium session
rD <- RSelenium::rsDriver(browser = "firefox", port = 4549L, verbose = FALSE)
# Call scrappy
out <- scrappy::google_maps(
client = rD$client,
name = "Sefton Park",
max_reviews = 20
)
# Stop server
rD$server$stop()
Output after removing original authors’ names and URL to their profiles:
id | author | author_url | comment | rating | locality | total_reviews | date_relative | date_absolute | date_downloaded |
---|---|---|---|---|---|---|---|---|---|
ChZDSUhNMG9nS0VJQ0FnSURWdG9XSVJBEAE | Author 1 | NA | 5 | 2 | 5 days ago | 2023-12-13 15:17:39 GMT | 2023-12-18 15:17:39 | ||
ChZDSUhNMG9nS0VJQ0FnSURWcHZhNkJBEAE | Author 2 | We had a baby shower in the cricket club which was great. Lovely park and good place to walk around in. | 5 | 26 | 5 days ago | 2023-12-13 15:17:39 GMT | 2023-12-18 15:17:39 | ||
ChZDSUhNMG9nS0VJQ0FnSUNWNzVhYUtREAE | Author 3 | Great place to take the kids | 5 | Local Guide | 22 | a week ago | 2023-12-11 15:17:39 GMT | 2023-12-18 15:17:39 | |
ChZDSUhNMG9nS0VJQ0FnSUR5OTZ1RkJ3EAE | Author 4 | This is such a lovely place to walk and chill out by the lakes. | 5 | Local Guide | 381 | a week ago | 2023-12-11 15:17:39 GMT | 2023-12-18 15:17:39 | |
ChZDSUhNMG9nS0VJQ0FnSUNWN2VuT2VREAE | Author 5 | NA | 5 | 9 | a week ago | 2023-12-11 15:17:39 GMT | 2023-12-18 15:17:39 | ||
ChdDSUhNMG9nS0VJQ0FnSUNWaHJXTGdRRRAB | Author 6 | Very clean. Lovely area. | 5 | 26 | a week ago | 2023-12-11 15:17:39 GMT | 2023-12-18 15:17:39 | ||
ChdDSUhNMG9nS0VJQ0FnSUNWb3J5TGhBRRAB | Author 7 | NA | 5 | Local Guide | 104 | a week ago | 2023-12-11 15:17:39 GMT | 2023-12-18 15:17:39 | |
ChdDSUhNMG9nS0VJQ0FnSURsLS1UaHR3RRAB | Author 8 | NA | 5 | 40 | 2 weeks ago | 2023-12-04 15:17:39 GMT | 2023-12-18 15:17:39 | ||
ChdDSUhNMG9nS0VJQ0FnSURsM1kzWndnRRAB | Author 9 | Beautiful histoical place and natural environment. | 5 | Local Guide | 21 | 2 weeks ago | 2023-12-04 15:17:39 GMT | 2023-12-18 15:17:39 | |
ChZDSUhNMG9nS0VJQ0FnSURsaHNXbEhnEAE | Author 10 | A park for everyone | 5 | Local Guide | 34 | 2 weeks ago | 2023-12-04 15:17:39 GMT | 2023-12-18 15:17:39 | |
ChdDSUhNMG9nS0VJQ0FnSURsbW9LYWxnRRAB | Author 11 | Beautiful park very big and lots of space | 5 | Local Guide | 10 | 2 weeks ago | 2023-12-04 15:17:43 GMT | 2023-12-18 15:17:43 | |
ChZDSUhNMG9nS0VJQ0FnSURsd3RydUdBEAE | Author 12 | NA | 5 | 8 | 2 weeks ago | 2023-12-04 15:17:43 GMT | 2023-12-18 15:17:43 | ||
ChdDSUhNMG9nS0VJQ0FnSURsdktUeS13RRAB | Author 13 | NA | 5 | Local Guide | 31 | 2 weeks ago | 2023-12-04 15:17:43 GMT | 2023-12-18 15:17:43 | |
ChZDSUhNMG9nS0VJQ0FnSURsdExxUmZBEAE | Author 14 | NA | 4 | NA | NA | 2 weeks ago | 2023-12-04 15:17:43 GMT | 2023-12-18 15:17:43 | |
ChdDSUhNMG9nS0VJQ0FnSURsdUpxc2tRRRAB | Author 15 | NA | 4 | 1 | 2 weeks ago | 2023-12-04 15:17:43 GMT | 2023-12-18 15:17:43 | ||
ChdDSUhNMG9nS0VJQ0FnSUNGbmNYRmh3RRAB | Author 16 | Lovely place for a walk | 5 | Local Guide | 159 | 2 weeks ago | 2023-12-04 15:17:43 GMT | 2023-12-18 15:17:43 | |
ChdDSUhNMG9nS0VJQ0FnSUNscDd6VWxBRRAB | Author 17 | NA | 5 | 22 | 2 weeks ago | 2023-12-04 15:17:43 GMT | 2023-12-18 15:17:43 | ||
ChdDSUhNMG9nS0VJQ0FnSUNsbV9QNGt3RRAB | Author 18 | NA | 5 | Local Guide | 152 | 2 weeks ago | 2023-12-04 15:17:43 GMT | 2023-12-18 15:17:43 | |
ChdDSUhNMG9nS0VJQ0FnSUNsNjZXVGdRRRAB | Author 19 | Never gets old! | 5 | Local Guide | 101 | 3 weeks ago | 2023-11-27 15:17:43 GMT | 2023-12-18 15:17:43 | |
ChdDSUhNMG9nS0VJQ0FnSUNsNmVLNm1RRRAB | Author 20 | NA | 3 | NA | NA | 3 weeks ago | 2023-11-27 15:17:43 GMT | 2023-12-18 15:17:43 |
# Create RSelenium session
rD <- RSelenium::rsDriver(browser = "firefox", port = 4549L, verbose = FALSE)
# Retrieve GP practices near L69 3GL
# (Waterhouse building, University of Liverpool)
out <- scrappy::find_a_gp(rD$client, postcode = "L69 3GL")
# Stop server
rD$server$stop()