r-public / room_utils

Some R programs to help in the room
Apache License 2.0
0 stars 2 forks source link

Scrape the "all rooms" page #6

Open mrdwab opened 8 years ago

mrdwab commented 8 years ago

There should be a function to scrape the "all rooms" page (with the relevant options of at least "active" and "people") (http://chat.stackoverflow.com/?tab=all&sort=active and http://chat.stackoverflow.com/?tab=all&sort=people) and return a data.frame of the relevant URLs. This would make the package more generally relevant.

@alistaire47, you seem to know what's up when it comes to scraping ;-)

I'm guessing it's something along the lines of starting with:

the_url_i_want %>% 
  read_html() %>% 
  html_node('#roomlist') %>% 
  html_nodes("h3") %>% 
  html_nodes("a")
alistaire47 commented 8 years ago

Check it out: https://github.com/alistaire47/room_utils/blob/master/rooms.R

I tried to get the roxygen comments roughly right, but please double-check them before we integrate it; I'm still pretty new to package development.

Also, I realized that despite prefixing all the non-base functions with ::, the scraping scripts still won't run without importing a package with pipes (unless we do something like magrittr::%>%, but that seems absurd), so I just put library(magrittr) at the top. I'm not sure if there's a standard way to deal with that issue, but I'm sure somebody has encountered it before.

romunov commented 8 years ago

Of course you can. You need to specify exports in your roxygen part of the script. See my examples here: https://github.com/romunov/zvau/blob/7135636db9d5a7436a35121dbbd26fd5c1396660/R/writeINEST.R

On Thu, Mar 31, 2016 at 9:24 AM, Edward Visel notifications@github.com wrote:

Check it out: https://github.com/alistaire47/room_utils/blob/master/rooms.R

I tried to get the roxygen comments roughly right, but please double-check them before we integrate it; I'm still pretty new to package development.

Also, I realized that despite prefixing all the non-base functions with ::, the scraping scripts still won't run without importing a package with pipes (unless we do something like magrittr::%>%, but that seems absurd), so I just put library(magrittr) at the top. I'm not sure if there's a standard way to deal with that issue, but I'm sure somebody has encountered it before.

— You are receiving this because you are subscribed to this thread. Reply to this email directly or view it on GitHub https://github.com/r-public/room_utils/issues/6#issuecomment-203794415

In God we trust, all others bring data.

alistaire47 commented 8 years ago

@romunov Oh perfect, thanks! I updated the script linked above.

Also, here's a little add-on function, which is useful but slow because it scrapes everything every time. (Maybe rooms() could be cached and only called if there's no match or it's demanded by another parameter, but I'm not sure if anybody would use the function repeatedly anyway.)

find_room <- function(room_name, exact = FALSE){
    pattern <- ifelse(exact == TRUE, paste0('/', room_name, '$'), room_name)
    grep(pattern, rooms(), value = TRUE, ignore.case = TRUE)
}

Documented: https://github.com/alistaire47/room_utils/blob/master/find_room.R

romunov commented 8 years ago

If you feel this overhead cost is too much, consider exporting the data into an external file and look for the existence (and time stamp) before scraping all the rooms again.

I would also suggest you write the code to the R package folder in a different branch. Once everyone likes the functionality (and it compiles OK), that branch can be merged seamlessly to the main branch.