tidyverse / rvest

Simple web scraping for R
https://rvest.tidyverse.org
Other
1.49k stars 343 forks source link

Support multiple <th> rows #286

Closed geotheory closed 3 years ago

geotheory commented 4 years ago

Current support is for a single row, which means where tables have an explicit multiple-row header (defined with tags) the function has no way to handle this. The tables in this page are such an example. Would it be desirable to support option of squashing multiple header rows?

hadley commented 3 years ago

I think this problem is a bit too general for rvest to tackle — figuring out exactly how to represent this sort of data in R is an open question, I think.

dansharkey commented 1 year ago

Having come across a similar problem today, I'm wondering if there is any proposal to implement a solution to this issue? I wonder if perhaps a nested list type approach might work well for this? Particularly with the new unnest_wider and unnest_longer functions, this might be conducive to a successful workflow. Alternatively, is there an rvest method that I am unfamiliar with that will provide access to both of the headers? A naive approach might be to concatenate the "first" header with the "second" header (in the example given this would produce column names of, for example, "Legislative election - Last", which would be easier to process.

If we can decide on an approach that would work nicely (either list-based or concatenated header rows), I might be able to donate some time to implement a solution.