tidyverse / rvest

Simple web scraping for R
https://rvest.tidyverse.org
Other
1.49k stars 341 forks source link

Finding elements/nodes in regex way #412

Open EkremBayar opened 2 months ago

EkremBayar commented 2 months ago

There are multiple ways of selecting elements by using XPath, CSS selectors, regular expression.

To reach some elements easier I've written a function that is used like dplyr functions. This function gathers three functions' features which they are starts_with(), contains() and ends_with().

Before I didn't know using regular expression on web scraping and had no idea about selectors. I've kind of learned them now and I can reach the elements without the function I wrote. However, beginners like me are supposed to research and learn how to reach the elements.

I wonder your opinions, adding a function as a new feature like that in the rvest package makes sense to reach the elements easier?

# Packages
library(rvest)
library(dplyr)

# Function
html_nodes_regex <- function(html, node_name, attr, regex_type = c("equal", "startswith", "contains", "endswith")){

  #https://developer.mozilla.org/en-US/docs/Web/CSS/Pseudo-classes
  #https://medium.com/yonder-techblog/css-regex-attribute-selectors-98075b7f4726

  # Checks
  if(missing(node_name)){stop("`node_name` cannot be missing!")}
  if(missing(attr)){stop("`attr` cannot be missing!")}
  if(missing(regex_type)){stop("`regex_type` cannot be missing!")}
  if(!is.character(node_name)){stop("The class of `node_name` has to be character!")}
  if(!is.character(attr)){stop("The class of `node_name` has to be character!")}
  if(!is.character(regex_type)){stop("The class of `node_name` has to be character!")}
  if(length(regex_type %in% c("equal","startswith", "contains", "endswith")) != 1){
    stop("`regex_type` has to be one of them: `equal`, `startswith`, `contains` or `endswith`!")
  }

  # Regex Type
  regex_type_check <- switch(regex_type,
                       equal = "",
                       startswith = "^",
                       contains = "*",
                       endswith = "$",
                       stop("Unknown `regext_type!` Type must be `equal`, `startswith`, `contains` or `endswith`", call. = FALSE)
  )

 # Selector Query 
  query <- paste0("[", attr, regex_type_check, "=", node_name, "]")

  # Selecting Elements
  html %>% rvest::html_nodes(query)

}

# Reading the HTML page of the Premier League
url <- "https://fbref.com/en/comps/9/Premier-League-Stats"
page <- rvest::read_html(url)
# Starts with
page %>% html_nodes_regex(node_name = "all_stats_squads_", attr = "id", regex_type = "startswith")
{xml_nodeset (11)}
 [1] <div id="all_stats_squads_standard" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="stats ...
 [2] <div id="all_stats_squads_keeper" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="stats_s ...
 [3] <div id="all_stats_squads_keeper_adv" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="sta ...
 [4] <div id="all_stats_squads_shooting" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="stats ...
 [5] <div id="all_stats_squads_passing" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="stats_ ...
 [6] <div id="all_stats_squads_passing_types" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id=" ...
 [7] <div id="all_stats_squads_gca" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="stats_squa ...
 [8] <div id="all_stats_squads_defense" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="stats_ ...
 [9] <div id="all_stats_squads_possession" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="sta ...
[10] <div id="all_stats_squads_playing_time" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="s ...
[11] <div id="all_stats_squads_misc" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="stats_squ ...
# Contains
page %>% html_nodes_regex(node_name = "squads_standar", attr = "id", regex_type = "contains")
{xml_nodeset (13)}
 [1] <div id="all_stats_squads_standard" class="table_wrapper tabbed">\n\t\n\t\t<span class="section_anchor" id="stats ...
 [2] <span class="section_anchor" id="stats_squads_standard_link" data-label="Squad Standard Stats"></span>
 [3] <div class="section_heading assoc_stats_squads_standard_for" id="stats_squads_standard_for_sh">\n  <span class="s ...
 [4] <span class="section_anchor" id="stats_squads_standard_for_link" data-label="Squad Standard Stats" data-no-inpage ...
 [5] <div class="section_heading hidden assoc_stats_squads_standard_against" id="stats_squads_standard_against_sh">\n  ...
 [6] <span class="section_anchor" id="stats_squads_standard_against_link" data-label="Squad Standard Stats" data-no-in ...
 [7] <div id="switcher_stats_squads_standard">\n\n\t<div class="table_container tabbed current" id="div_stats_squads_s ...
 [8] <div class="table_container tabbed current" id="div_stats_squads_standard_for">\n\t\t\n\t\t<table class="stats_ta ...
 [9] <table class="stats_table sortable min_width" id="stats_squads_standard_for" data-cols-to-freeze=",1">\n<caption> ...
[10] <div class="footer no_hide_long" id="tfooter_stats_squads_standard_for">\n\t\t\n\t\t<small>Totals may not be comp ...
[11] <div class="table_container tabbed" id="div_stats_squads_standard_against">\n\t\t\n\t\t<table class="stats_table  ...
[12] <table class="stats_table sortable min_width" id="stats_squads_standard_against" data-cols-to-freeze=",1">\n<capt ...
[13] <div class="footer no_hide_long" id="tfooter_stats_squads_standard_against">\n\t\t\n\t\t<small>Totals may not be  ...
# Ends with
page %>% html_nodes_regex(node_name = "_for", attr = "id", regex_type = "endswith")
{xml_nodeset (33)}
 [1] <div class="table_container tabbed current" id="div_stats_squads_standard_for">\n\t\t\n\t\t<table class="stats_ta ...
 [2] <table class="stats_table sortable min_width" id="stats_squads_standard_for" data-cols-to-freeze=",1">\n<caption> ...
 [3] <div class="footer no_hide_long" id="tfooter_stats_squads_standard_for">\n\t\t\n\t\t<small>Totals may not be comp ...
 [4] <div class="table_container tabbed current" id="div_stats_squads_keeper_for">\n\t\t\n\t\t<table class="stats_tabl ...
 [5] <table class="stats_table sortable min_width" id="stats_squads_keeper_for" data-cols-to-freeze=",1">\n<caption>Sq ...
 [6] <div class="footer no_hide_long" id="tfooter_stats_squads_keeper_for">\n\t\t\n\t\t<small>Totals may not be comple ...
 [7] <div class="table_container tabbed current" id="div_stats_squads_keeper_adv_for">\n\t\t\n\t\t<table class="stats_ ...
 [8] <table class="stats_table sortable min_width" id="stats_squads_keeper_adv_for" data-cols-to-freeze=",1">\n<captio ...
 [9] <div class="footer no_hide_long" id="tfooter_stats_squads_keeper_adv_for">\n\t\t\n\t\t<small>Totals may not be co ...
[10] <div class="table_container tabbed current" id="div_stats_squads_shooting_for">\n\t\t\n\t\t<table class="stats_ta ...
[11] <table class="stats_table sortable min_width" id="stats_squads_shooting_for" data-cols-to-freeze=",1">\n<caption> ...
[12] <div class="footer no_hide_long" id="tfooter_stats_squads_shooting_for">\n\t\t\n\t\t<small>Totals may not be comp ...
[13] <div class="table_container tabbed current" id="div_stats_squads_passing_for">\n\t\t\n\t\t<table class="stats_tab ...
[14] <table class="stats_table sortable min_width" id="stats_squads_passing_for" data-cols-to-freeze=",1">\n<caption>S ...
[15] <div class="footer no_hide_long" id="tfooter_stats_squads_passing_for">\n\t\t\n\t\t<small>Totals may not be compl ...
[16] <div class="table_container tabbed current" id="div_stats_squads_passing_types_for">\n\t\t\n\t\t<table class="sta ...
[17] <table class="stats_table sortable min_width" id="stats_squads_passing_types_for" data-cols-to-freeze=",1">\n<cap ...
[18] <div class="footer no_hide_long" id="tfooter_stats_squads_passing_types_for">\n\t\t\n\t\t<small>Totals may not be ...
[19] <div class="table_container tabbed current" id="div_stats_squads_gca_for">\n\t\t\n\t\t<table class="stats_table s ...
[20] <table class="stats_table sortable min_width" id="stats_squads_gca_for" data-cols-to-freeze=",1">\n<caption>Squad ...
...

Best regards, Ekrem.