r-lib / xml2

Bindings to libxml2
https://xml2.r-lib.org/
Other
220 stars 81 forks source link

url_absolute fails with spaces in url #401

Open jonthegeek opened 1 year ago

jonthegeek commented 1 year ago

xml_attr(x, "href") returns un-encoded URLs if that's how they appear in the source, but then those URLs fail in url_absolute.

url <- "/filename with spaces.pdf" 
xml2::url_absolute(
  url,
  base = "https://example.com/"
)
#> [1] NA
xml2::url_absolute(
  utils::URLencode(url),
  base = "https://example.com/"
)
#> [1] "https://example.com/filename%20with%20spaces.pdf"

Created on 2023-08-23 with reprex v2.0.2

url_absolute() gets confused if the URL contains spaces, and silently returns NA. This should at least warn the user, but it might be preferable to deal with it directly.

This is where I found it in the wild:

base_url <- "https://www.copyright.gov/fair-use/fair-index.html"

pdf_urls <-
  rvest::read_html(base_url) |> 
  rvest::html_element("table") |> 
  rvest::html_elements("tr>td:first-of-type>a:first-of-type") |>
  rvest::html_attr("href")

pdf_urls[[10]] |> 
  rvest::url_absolute(base_url)
#> [1] NA

pdf_urls[[10]] |> 
  utils::URLencode() |> 
  rvest::url_absolute(base_url)
#> [1] "https://www.copyright.gov/fair-use/summaries/ONeil%20v.%20Ratajkowski%20No.%2019%20CIV.%209769%20(S.D.N.Y.%202021).pdf"

Created on 2023-08-23 with reprex v2.0.2