There are multiple ways of selecting elements by using XPath, CSS selectors, regular expression.
To reach some elements easier I've written a function that is used like dplyr functions. This function gathers three functions' features which they are starts_with(), contains() and ends_with().
Before I didn't know using regular expression on web scraping and had no idea about selectors. I've kind of learned them now and I can reach the elements without the function I wrote. However, beginners like me are supposed to research and learn how to reach the elements.
I wonder your opinions, adding a function as a new feature like that in the rvest package makes sense to reach the elements easier?
# Packages
library(rvest)
library(dplyr)
# Function
html_nodes_regex <- function(html, node_name, attr, regex_type = c("equal", "startswith", "contains", "endswith")){
#https://developer.mozilla.org/en-US/docs/Web/CSS/Pseudo-classes
#https://medium.com/yonder-techblog/css-regex-attribute-selectors-98075b7f4726
# Checks
if(missing(node_name)){stop("`node_name` cannot be missing!")}
if(missing(attr)){stop("`attr` cannot be missing!")}
if(missing(regex_type)){stop("`regex_type` cannot be missing!")}
if(!is.character(node_name)){stop("The class of `node_name` has to be character!")}
if(!is.character(attr)){stop("The class of `node_name` has to be character!")}
if(!is.character(regex_type)){stop("The class of `node_name` has to be character!")}
if(length(regex_type %in% c("equal","startswith", "contains", "endswith")) != 1){
stop("`regex_type` has to be one of them: `equal`, `startswith`, `contains` or `endswith`!")
}
# Regex Type
regex_type_check <- switch(regex_type,
equal = "",
startswith = "^",
contains = "*",
endswith = "$",
stop("Unknown `regext_type!` Type must be `equal`, `startswith`, `contains` or `endswith`", call. = FALSE)
)
# Selector Query
query <- paste0("[", attr, regex_type_check, "=", node_name, "]")
# Selecting Elements
html %>% rvest::html_nodes(query)
}
# Reading the HTML page of the Premier League
url <- "https://fbref.com/en/comps/9/Premier-League-Stats"
page <- rvest::read_html(url)
There are multiple ways of selecting elements by using XPath, CSS selectors, regular expression.
To reach some elements easier I've written a function that is used like dplyr functions. This function gathers three functions' features which they are
starts_with()
,contains()
andends_with()
.Before I didn't know using regular expression on web scraping and had no idea about selectors. I've kind of learned them now and I can reach the elements without the function I wrote. However, beginners like me are supposed to research and learn how to reach the elements.
I wonder your opinions, adding a function as a new feature like that in the rvest package makes sense to reach the elements easier?
Best regards, Ekrem.