hrbrmstr / htmlunit

🕸🧰☕️Tools to Scrape Dynamic Web Content via the 'HtmlUnit' Java Library
Apache License 2.0
37 stars 6 forks source link

error message: Specified cast is not valid; An unhandled exception was generated during the execution of the current web request. #5

Open werkstattcodes opened 4 years ago

werkstattcodes commented 4 years ago

First, many thanks for (again another) very helpful package!

I am trying to use the htmlunit package to scrap results from https://juris.ohchr.org/Search/Documents

(the site doesn’t show any search results unless you select at least one search option, e.g treaty).

While browsing the site/reading the results works (most of the time), I get an error message when trying to retrieve the results with the htmlunit package.

Is this error exclusively related to the website in the sense that it is not properly set up, or is there something to htmlunit what triggers the error? If so, any means to circumvent this error with htmlunit?

If you think this is something better to put on SO let me know.

Thanks again!

library(htmlunit)
#> Loading required package: rJava
#> Loading required package: htmlunitjars
#> Loading required package: rvest
#> Loading required package: xml2
library(tidyverse)
#> Warning: package 'dplyr' was built under R version 3.6.2

my_site2 <- "http://juris.ohchr.org/search/results/2?typeOfDecisionFilter=0&countryFilter=0&treatyFilter=0"
#my_site2 <- "https://juris.ohchr.org/search/results"

js_pg2 <- htmlunit::hu_read_html(my_site2)
js_pg2
#> {html_document}
#> <html>
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body bgcolor="white">\r\n    <span>\r\n      <h1>\r\n        Server Erro ...
html_nodes(js_pg2, "td")
#> {xml_nodeset (2)}
#> [1] <td>\r\n              <code>\r\n                \n\nAn unhandled exceptio ...
#> [2] <td>\r\n              <code>\r\n                <pre>\r\n                 ...

Created on 2020-03-05 by the reprex package (v0.3.0)

hrbrmstr commented 4 years ago

Well met! And, #ty for kicking the using {htmlunit} and taking the time to file an issue!

I was able to reproduce that website error with the code sample but I'm fairly certain it's not {htmlunit}.

The URL here:

my_site2 <- "http://juris.ohchr.org/search/results/2?typeOfDecisionFilter=0&countryFilter=0&treatyFilter=0"

won't work because it doesn't have the context of the original search which gets submitted behind the scenes as a POST request.

Thankfully, that site is basic enough that you don't need a javascript-enabled context to get the results.

I'm on kid duty this morning so I can't expand on the following right now, but this is the idiom you can use for searching & fetching results on that site. Drop a comment reply if any of it needs further clarification:

library(httr)
library(rvest)
library(stringi)
library(tidyverse)

# make the initial search request

httr::POST(
  url = "https://juris.ohchr.org/search/results",
  encode = "form",
  body = list(
    Keyword = "", 
    SearchOperatorType = "0",
    Symbol = "", 
    AdoptionOfViewYear = "2019", 
    EndAdoptionOfViewYear = "2020"
  )
) -> res

# get the first table

pg <- httr::content(res)

html_node(pg, "table.results") %>% 
  html_table() %>% 
  as_tibble() %>% 
  janitor::clean_names() -> tbl1

tbl1
## # A tibble: 10 x 9
##    display_name treaties countries symbols date_of_adoptio… issues articles
##    <chr>        <chr>    <chr>     <chr>   <chr>            <chr>  <chr>   
##  1 A.B.         CRC      Spain     CRC/C/… 07 Feb 2020      "admi… CRC-12C…
##  2 N.R.         CRC      Paraguay  CRC/C/… 03 Feb 2020      "admi… CRC-10-…
##  3 Natalia Cio… CEDAW    Republic… CEDAW/… 04 Nov 2019      "disc… 11(1)(E…
##  4 El Hasnaoui… CESCR    Spain     E/C.12… 22 Oct 2019      "hous… CESCR-1…
##  5 López Albán… CESCR    Spain     E/C.12… 11 Oct 2019      "admi… CESCR-1…
##  6 S. S. R.     CESCR    Spain     E/C.12… 11 Oct 2019      "admi… CESCR-1…
##  7 M. L. B.     CESCR    Luxembou… E/C.12… 11 Oct 2019      "admi… CESCR-8…
##  8 M. T. et al  CESCR    Spain     E/C.12… 11 Oct 2019      ""     CESCR-1…
##  9 M. P. y otr… CESCR    Spain     E/C.12… 11 Oct 2019      "hous… CESCR-1…
## 10 Z. P. y otr… CESCR    Spain     E/C.12… 11 Oct 2019      "hous… CESCR-1…
## # … with 2 more variables: communications <chr>, type_of_decisions <chr>

# find how many pages of content we have

html_nodes(pg, "section.content") %>% 
  html_text() %>% 
  stri_match_all_regex("([[:digit:]]+) results found page ([[:digit:]]+) of ([[:digit:]]+)") %>% 
  unlist() %>% 
  .[-1] %>% 
  as.integer() %>% 
  set_names(c("total", "cur_pg", "last_pg")) %>% 
  as.list() -> results_info

str(results_info, 1)
## List of 3
##  $ total  : int 61
##  $ cur_pg : int 1
##  $ last_pg: int 7

# retrieve the link for the next page and make it generic so we can generate a list of them

html_node(pg, "ul.pagination > li > a[href]") %>% 
  html_attr("href") -> results_pattern

results_pattern
## [1] "/search/results/2?typeOfDecisionFilter=0&countryFilter=0&treatyFilter=0"

results_pattern <- stri_replace_first_regex(results_pattern, "/([[:digit:]]+)\\?", "/%s?")

remaining_urls <- paste0("https://juris.ohchr.org", sprintf(results_pattern, 2:results_info$last_pg))

remaining_urls
## [1] "https://juris.ohchr.org/search/results/2?typeOfDecisionFilter=0&countryFilter=0&treatyFilter=0"
## [2] "https://juris.ohchr.org/search/results/3?typeOfDecisionFilter=0&countryFilter=0&treatyFilter=0"
## [3] "https://juris.ohchr.org/search/results/4?typeOfDecisionFilter=0&countryFilter=0&treatyFilter=0"
## [4] "https://juris.ohchr.org/search/results/5?typeOfDecisionFilter=0&countryFilter=0&treatyFilter=0"
## [5] "https://juris.ohchr.org/search/results/6?typeOfDecisionFilter=0&countryFilter=0&treatyFilter=0"
## [6] "https://juris.ohchr.org/search/results/7?typeOfDecisionFilter=0&countryFilter=0&treatyFilter=0"

# take the page 1 table, then do the same idiom, this time with a GET request 
# since that's how the site does it and bind the results together

bind_rows(
  tbl1,
  map_df(remaining_urls, ~{

    res <- httr::GET(url = .x)

    pg <- httr::content(res)

    html_node(pg, "table.results") %>% 
      html_table() %>% 
      as_tibble() %>% 
      janitor::clean_names()

  })
) -> all_pages

all_pages
## # A tibble: 61 x 9
##    display_name treaties countries symbols date_of_adoptio… issues articles
##    <chr>        <chr>    <chr>     <chr>   <chr>            <chr>  <chr>   
##  1 A.B.         CRC      Spain     CRC/C/… 07 Feb 2020      "admi… CRC-12C…
##  2 N.R.         CRC      Paraguay  CRC/C/… 03 Feb 2020      "admi… CRC-10-…
##  3 Natalia Cio… CEDAW    Republic… CEDAW/… 04 Nov 2019      "disc… 11(1)(E…
##  4 El Hasnaoui… CESCR    Spain     E/C.12… 22 Oct 2019      "hous… CESCR-1…
##  5 López Albán… CESCR    Spain     E/C.12… 11 Oct 2019      "admi… CESCR-1…
##  6 S. S. R.     CESCR    Spain     E/C.12… 11 Oct 2019      "admi… CESCR-1…
##  7 M. L. B.     CESCR    Luxembou… E/C.12… 11 Oct 2019      "admi… CESCR-8…
##  8 M. T. et al  CESCR    Spain     E/C.12… 11 Oct 2019      ""     CESCR-1…
##  9 M. P. y otr… CESCR    Spain     E/C.12… 11 Oct 2019      "hous… CESCR-1…
## 10 Z. P. y otr… CESCR    Spain     E/C.12… 11 Oct 2019      "hous… CESCR-1…
## # … with 51 more rows, and 2 more variables: communications <chr>,
## #   type_of_decisions <chr>
hrbrmstr commented 4 years ago

I simplified this a bit:

https://cinc.rud.is/web/packages/unjuris/

remotes::install_git("https://git.rud.is/hrbrmstr/unjuris.git")

Then:

library(tibble) # for pretty printing

(xdf <- juris_search(year_start = 2019, year_end = 2020))
## # A tibble: 61 x 10
##    display_name  treaties countries  symbols date_of_adoptio… issues articles communications type_of_decisio… detail_url
##    <chr>         <chr>    <chr>      <chr>   <chr>            <chr>  <chr>    <chr>          <chr>            <chr>     
##  1 A.B.          CRC      Spain      CRC/C/… 07 Feb 2020      "admi… CRC-12C… 024/2017       Adoption of vie… https://j…
##  2 N.R.          CRC      Paraguay   CRC/C/… 03 Feb 2020      "admi… CRC-10-… 030/2017       Adoption of vie… https://j…
##  3 Natalia Ciob… CEDAW    Republic … CEDAW/… 04 Nov 2019      "disc… 11(1)(E… 104/2016       Adoption of vie… https://j…
##  4 El Hasnaoui … CESCR    Spain      E/C.12… 22 Oct 2019      "hous… CESCR-1… 060/2018       Discontinuance … https://j…
##  5 López Albán … CESCR    Spain      E/C.12… 11 Oct 2019      "admi… CESCR-1… 037/2018       Adoption of vie… https://j…
##  6 S. S. R.      CESCR    Spain      E/C.12… 11 Oct 2019      "admi… CESCR-1… 051/2018       Inadmissibility… https://j…
##  7 M. L. B.      CESCR    Luxembourg E/C.12… 11 Oct 2019      "admi… CESCR-8… 020/2017       Inadmissibility… https://j…
##  8 M. T. et al   CESCR    Spain      E/C.12… 11 Oct 2019      ""     CESCR-1… 110/2019       Discontinuance … https://j…
##  9 M. P. y otros CESCR    Spain      E/C.12… 11 Oct 2019      "hous… CESCR-1… 096/2019       Discontinuance … https://j…
## 10 Z. P. y otros CESCR    Spain      E/C.12… 11 Oct 2019      "hous… CESCR-1… 043/2018       Discontinuance … https://j…
## # … with 51 more rows

get_details(xdf$detail_url[10])
## # A tibble: 6 x 5
##   language doc                         docx                        pdf                        html                      
##   <chr>    <chr>                       <chr>                       <chr>                      <chr>                     
## 1 English  http://docstore.ohchr.org/… http://docstore.ohchr.org/… http://docstore.ohchr.org… http://docstore.ohchr.org…
## 2 Français http://docstore.ohchr.org/… http://docstore.ohchr.org/… http://docstore.ohchr.org… http://docstore.ohchr.org…
## 3 Español  http://docstore.ohchr.org/… http://docstore.ohchr.org/… http://docstore.ohchr.org… http://docstore.ohchr.org…
## 4 العربية  http://docstore.ohchr.org/… http://docstore.ohchr.org/… http://docstore.ohchr.org… http://docstore.ohchr.org…
## 5 中文     http://docstore.ohchr.org/… http://docstore.ohchr.org/… http://docstore.ohchr.org… http://docstore.ohchr.org…
## 6 русский  http://docstore.ohchr.org/… http://docstore.ohchr.org/… http://docstore.ohchr.org… http://docstore.ohchr.org…

get_details(2606)
## # A tibble: 3 x 5
##   language doc                         docx                        pdf                        html                      
##   <chr>    <chr>                       <chr>                       <chr>                      <chr>                     
## 1 English  http://docstore.ohchr.org/… http://docstore.ohchr.org/… http://docstore.ohchr.org… http://docstore.ohchr.org…
## 2 Español  http://docstore.ohchr.org/… http://docstore.ohchr.org/… http://docstore.ohchr.org… http://docstore.ohchr.org…
## 3 中文     http://docstore.ohchr.org/… http://docstore.ohchr.org/… http://docstore.ohchr.org… http://docstore.ohchr.org…
werkstattcodes commented 4 years ago

I am looking for the right emoji.... I mean this is just great.

Although I managed to download the data in the meantime via a detour (somewhat clumsy rvest (here)) your package will be tremendeously helpful.

Much of what is happening in your code is (still) beyond my understanding, but I'll definitly will dig into it and if possible, will try (!) to contribute.

hrbrmstr commented 4 years ago

glad to help! When i get some cycles I'll annotate what's going on in the code since you'll likely be able to apply the idiom to other "scraping" tasks once the process is a bit clearer.