tidyverse / rvest

Simple web scraping for R
https://rvest.tidyverse.org
Other
1.49k stars 343 forks source link

html_table() fails when rowspan and/or colspan attributes are blank #322

Closed epiben closed 3 years ago

epiben commented 3 years ago

Parsing with html_table fails when at least one cell contains an empty values for rowspan and/or colspan.

Only the first of each of the following quadruples works currently. Two little changes in table_fill(lines 172-173) suffice to resolve the problem. As per the Contributing page, I didn't make a pull request directly but wanted to raise this as a normal issue first. My proposal can be seen in my fork of rvest: https://github.com/epiben/rvest/commit/8f1a78281cb8a75ad7c839aa81dc9c87b3012ae2.

I thought it would be best to handle this quite particular kind of "default" value where it's needed as opposed to more upstream such as in html_attr.

library(rvest) # whether installed from CRAN or the GitHub repo

# Single-cell tables
minimal_html("<table><tr><td>Cell1</td></tr></table>") %>% 
    html_table()

minimal_html("<table><tr><td rowspan=''>Cell1</td></tr></table>") %>% 
    html_table()

minimal_html("<table><tr><td colspan=''>Cell1</td></tr></table>") %>% 
    html_table()

minimal_html("<table><tr><td colspan='' rowspan=''>Cell1</td></tr></table>") %>% 
    html_table()

# Single-row, multi-column tables
minimal_html("<table><tr><td>Cell1</td><td>Cell2</td></tr></table>") %>% 
    html_table()

minimal_html("<table><tr><td rowspan=''>Cell1</td><td rowspan=''>Cell2</td></tr></table>") %>% 
    html_table()

minimal_html("<table><tr><td colspan=''>Cell1</td><td colspan=''>Cell2</td></tr></table>") %>% 
    html_table()

minimal_html("<table><tr><td colspan='' rowspan=''>Cell1</td><td colspan='' rowspan=''>Cell2</td></tr></table>") %>% 
    html_table()

# Multi-row, multi-column tables
minimal_html("
    <table>
        <tr><td>Cell1</td><td>Cell2</td></tr>
        <tr><td colspan=2>Cell3</td></tr>
    </table>") %>% 
    html_table()

minimal_html("
    <table>
        <tr><td rowspan=''>Cell1</td><td rowspan=''>Cell2</td></tr>
        <tr><td rowspan='' colspan=2>Cell3</td></tr>
    </table>") %>% 
    html_table()

minimal_html("
    <table>
        <tr><td colspan=''>Cell1</td><td colspan=''>Cell2</td></tr>
        <tr><td colspan=2>Cell3</td></tr>
    </table>") %>% 
    html_table()

minimal_html("
    <table>
        <tr><td rowspan='' colspan=''>Cell1</td><td rowspan='' colspan=''>Cell2</td></tr>
        <tr><td rowspan='' colspan=2>Cell3</td></tr>
    </table>") %>% 
    html_table()
hadley commented 3 years ago

Please submit a PR 😀

epiben commented 3 years ago

Done! 😀