Open jonthegeek opened 1 year ago
xml_attr(x, "href") returns un-encoded URLs if that's how they appear in the source, but then those URLs fail in url_absolute.
xml_attr(x, "href")
url_absolute
url <- "/filename with spaces.pdf" xml2::url_absolute( url, base = "https://example.com/" ) #> [1] NA xml2::url_absolute( utils::URLencode(url), base = "https://example.com/" ) #> [1] "https://example.com/filename%20with%20spaces.pdf"
Created on 2023-08-23 with reprex v2.0.2
url_absolute() gets confused if the URL contains spaces, and silently returns NA. This should at least warn the user, but it might be preferable to deal with it directly.
url_absolute()
This is where I found it in the wild:
base_url <- "https://www.copyright.gov/fair-use/fair-index.html" pdf_urls <- rvest::read_html(base_url) |> rvest::html_element("table") |> rvest::html_elements("tr>td:first-of-type>a:first-of-type") |> rvest::html_attr("href") pdf_urls[[10]] |> rvest::url_absolute(base_url) #> [1] NA pdf_urls[[10]] |> utils::URLencode() |> rvest::url_absolute(base_url) #> [1] "https://www.copyright.gov/fair-use/summaries/ONeil%20v.%20Ratajkowski%20No.%2019%20CIV.%209769%20(S.D.N.Y.%202021).pdf"
xml_attr(x, "href")
returns un-encoded URLs if that's how they appear in the source, but then those URLs fail inurl_absolute
.Created on 2023-08-23 with reprex v2.0.2
url_absolute()
gets confused if the URL contains spaces, and silently returns NA. This should at least warn the user, but it might be preferable to deal with it directly.This is where I found it in the wild:
Created on 2023-08-23 with reprex v2.0.2