tidyverse / rvest

Simple web scraping for R
https://rvest.tidyverse.org
Other
1.49k stars 341 forks source link

html_table could expose .name_repair argument, to pass through to as_tibble() #340

Open rhalbersma opened 2 years ago

rhalbersma commented 2 years ago

When parsing HTML tables, it is frequently the case that non-unique column names appear, e.g. when column names are multi-row and the first row spans multiple columns.

It would be nice if html_table could expose .name_repair as an argument to pass through to as_tibble. As it stands, the current implementation uses a hard-coded .name_repair = "minimal" in its call to as_tibble. This currently requires users to add an extra as_tibble(.name_repair = "unique") in pipelines parsing more complicated HTML tables.

Such as an extension would be in line with the recommendation from https://www.tidyverse.org/blog/2018/11/tibble-2.0.0-pre-announce/

Packages that are in the business of making tibbles may even want to expose the .name_repair argument and pass it through to tibble() or as_tibble(). For example, this is the approach planned for readxl, which reads rectangular data out of Excel workbooks.

djvill commented 1 year ago

I've run into this issue too, and I have a reprex based on a Wiki page:

library(rvest)
mich <- "https://en.wikipedia.org/w/index.php?title=List_of_tallest_buildings_by_county_in_Michigan&oldid=1089312494"
michTable <- read_html(mich) %>% 
  html_element(".wikitable")
html_elements(michTable, "th")
#> {xml_nodeset (11)}
#>  [1] <th rowspan="2">County</th>
#>  [2] <th rowspan="2">City</th>
#>  [3] <th rowspan="2" style="width: 22%;">Building</th>
#>  [4] <th rowspan="2">Image</th>
#>  [5] <th colspan="2">Height</th>
#>  [6] <th rowspan="2">Floors</th>
#>  [7] <th rowspan="2">Year<sup class="reference" id="ref_note02^"><a href="#en ...
#>  [8] <th rowspan="2">Primary purpose</th>
#>  [9] <th rowspan="2">Previous<br>names\n</th>
#> [10] <th>(ft)</th>
#> [11] <th>(m)\n</th>
michDF <- html_table(michTable)
head(michDF[,1:6])
#> # A tibble: 6 x 6
#>   County         City        Building                        Image Height Height
#>   <chr>          <chr>       <chr>                           <chr> <chr>  <chr> 
#> 1 County         City        Building                        "Ima~ "(ft)" "(m)" 
#> 2 Alcona County  Harrisville Alcona County Building[1]       ""    "12.1~ "3.70"
#> 3 Alger County   Munising    Alger County Courthouse[2]      ""    "24.2~ "7.40"
#> 4 Allegan County Allegan     Allegan County Building[3]      ""    "24.2~ "7.40"
#> 5 Alpena County  Alpena      Northland Area Federal Credit ~ ""    "48.5~ "14.8~
#> 6 Antrim County  Bellaire    Antrim County Courthouse[5]     ""    ""     ""
tryCatch(select(michDF, Height), 
         error = function(e) e)
#> <simpleError in select(michDF, Height): could not find function "select">

Created on 2022-09-11 by the reprex package (v2.0.1)

As @rhalbersma indicated, this is triggered by a bunch of ths that have a rowspan >1 and one with a colspan > 1. Ideally, the name repair would default to treating the 2nd-row ths as suffixes to "Height", giving us unique colnames "Height_(ft)" and "Height_(m)".