Open werkstattcodes opened 4 years ago
Well met! And, #ty for kicking the using {htmlunit} and taking the time to file an issue!
I was able to reproduce that website error with the code sample but I'm fairly certain it's not {htmlunit}.
The URL here:
my_site2 <- "http://juris.ohchr.org/search/results/2?typeOfDecisionFilter=0&countryFilter=0&treatyFilter=0"
won't work because it doesn't have the context of the original search which gets submitted behind the scenes as a POST request.
Thankfully, that site is basic enough that you don't need a javascript-enabled context to get the results.
I'm on kid duty this morning so I can't expand on the following right now, but this is the idiom you can use for searching & fetching results on that site. Drop a comment reply if any of it needs further clarification:
library(httr)
library(rvest)
library(stringi)
library(tidyverse)
# make the initial search request
httr::POST(
url = "https://juris.ohchr.org/search/results",
encode = "form",
body = list(
Keyword = "",
SearchOperatorType = "0",
Symbol = "",
AdoptionOfViewYear = "2019",
EndAdoptionOfViewYear = "2020"
)
) -> res
# get the first table
pg <- httr::content(res)
html_node(pg, "table.results") %>%
html_table() %>%
as_tibble() %>%
janitor::clean_names() -> tbl1
tbl1
## # A tibble: 10 x 9
## display_name treaties countries symbols date_of_adoptio… issues articles
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 A.B. CRC Spain CRC/C/… 07 Feb 2020 "admi… CRC-12C…
## 2 N.R. CRC Paraguay CRC/C/… 03 Feb 2020 "admi… CRC-10-…
## 3 Natalia Cio… CEDAW Republic… CEDAW/… 04 Nov 2019 "disc… 11(1)(E…
## 4 El Hasnaoui… CESCR Spain E/C.12… 22 Oct 2019 "hous… CESCR-1…
## 5 López Albán… CESCR Spain E/C.12… 11 Oct 2019 "admi… CESCR-1…
## 6 S. S. R. CESCR Spain E/C.12… 11 Oct 2019 "admi… CESCR-1…
## 7 M. L. B. CESCR Luxembou… E/C.12… 11 Oct 2019 "admi… CESCR-8…
## 8 M. T. et al CESCR Spain E/C.12… 11 Oct 2019 "" CESCR-1…
## 9 M. P. y otr… CESCR Spain E/C.12… 11 Oct 2019 "hous… CESCR-1…
## 10 Z. P. y otr… CESCR Spain E/C.12… 11 Oct 2019 "hous… CESCR-1…
## # … with 2 more variables: communications <chr>, type_of_decisions <chr>
# find how many pages of content we have
html_nodes(pg, "section.content") %>%
html_text() %>%
stri_match_all_regex("([[:digit:]]+) results found page ([[:digit:]]+) of ([[:digit:]]+)") %>%
unlist() %>%
.[-1] %>%
as.integer() %>%
set_names(c("total", "cur_pg", "last_pg")) %>%
as.list() -> results_info
str(results_info, 1)
## List of 3
## $ total : int 61
## $ cur_pg : int 1
## $ last_pg: int 7
# retrieve the link for the next page and make it generic so we can generate a list of them
html_node(pg, "ul.pagination > li > a[href]") %>%
html_attr("href") -> results_pattern
results_pattern
## [1] "/search/results/2?typeOfDecisionFilter=0&countryFilter=0&treatyFilter=0"
results_pattern <- stri_replace_first_regex(results_pattern, "/([[:digit:]]+)\\?", "/%s?")
remaining_urls <- paste0("https://juris.ohchr.org", sprintf(results_pattern, 2:results_info$last_pg))
remaining_urls
## [1] "https://juris.ohchr.org/search/results/2?typeOfDecisionFilter=0&countryFilter=0&treatyFilter=0"
## [2] "https://juris.ohchr.org/search/results/3?typeOfDecisionFilter=0&countryFilter=0&treatyFilter=0"
## [3] "https://juris.ohchr.org/search/results/4?typeOfDecisionFilter=0&countryFilter=0&treatyFilter=0"
## [4] "https://juris.ohchr.org/search/results/5?typeOfDecisionFilter=0&countryFilter=0&treatyFilter=0"
## [5] "https://juris.ohchr.org/search/results/6?typeOfDecisionFilter=0&countryFilter=0&treatyFilter=0"
## [6] "https://juris.ohchr.org/search/results/7?typeOfDecisionFilter=0&countryFilter=0&treatyFilter=0"
# take the page 1 table, then do the same idiom, this time with a GET request
# since that's how the site does it and bind the results together
bind_rows(
tbl1,
map_df(remaining_urls, ~{
res <- httr::GET(url = .x)
pg <- httr::content(res)
html_node(pg, "table.results") %>%
html_table() %>%
as_tibble() %>%
janitor::clean_names()
})
) -> all_pages
all_pages
## # A tibble: 61 x 9
## display_name treaties countries symbols date_of_adoptio… issues articles
## <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 A.B. CRC Spain CRC/C/… 07 Feb 2020 "admi… CRC-12C…
## 2 N.R. CRC Paraguay CRC/C/… 03 Feb 2020 "admi… CRC-10-…
## 3 Natalia Cio… CEDAW Republic… CEDAW/… 04 Nov 2019 "disc… 11(1)(E…
## 4 El Hasnaoui… CESCR Spain E/C.12… 22 Oct 2019 "hous… CESCR-1…
## 5 López Albán… CESCR Spain E/C.12… 11 Oct 2019 "admi… CESCR-1…
## 6 S. S. R. CESCR Spain E/C.12… 11 Oct 2019 "admi… CESCR-1…
## 7 M. L. B. CESCR Luxembou… E/C.12… 11 Oct 2019 "admi… CESCR-8…
## 8 M. T. et al CESCR Spain E/C.12… 11 Oct 2019 "" CESCR-1…
## 9 M. P. y otr… CESCR Spain E/C.12… 11 Oct 2019 "hous… CESCR-1…
## 10 Z. P. y otr… CESCR Spain E/C.12… 11 Oct 2019 "hous… CESCR-1…
## # … with 51 more rows, and 2 more variables: communications <chr>,
## # type_of_decisions <chr>
I simplified this a bit:
https://cinc.rud.is/web/packages/unjuris/
remotes::install_git("https://git.rud.is/hrbrmstr/unjuris.git")
Then:
library(tibble) # for pretty printing
(xdf <- juris_search(year_start = 2019, year_end = 2020))
## # A tibble: 61 x 10
## display_name treaties countries symbols date_of_adoptio… issues articles communications type_of_decisio… detail_url
## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 A.B. CRC Spain CRC/C/… 07 Feb 2020 "admi… CRC-12C… 024/2017 Adoption of vie… https://j…
## 2 N.R. CRC Paraguay CRC/C/… 03 Feb 2020 "admi… CRC-10-… 030/2017 Adoption of vie… https://j…
## 3 Natalia Ciob… CEDAW Republic … CEDAW/… 04 Nov 2019 "disc… 11(1)(E… 104/2016 Adoption of vie… https://j…
## 4 El Hasnaoui … CESCR Spain E/C.12… 22 Oct 2019 "hous… CESCR-1… 060/2018 Discontinuance … https://j…
## 5 López Albán … CESCR Spain E/C.12… 11 Oct 2019 "admi… CESCR-1… 037/2018 Adoption of vie… https://j…
## 6 S. S. R. CESCR Spain E/C.12… 11 Oct 2019 "admi… CESCR-1… 051/2018 Inadmissibility… https://j…
## 7 M. L. B. CESCR Luxembourg E/C.12… 11 Oct 2019 "admi… CESCR-8… 020/2017 Inadmissibility… https://j…
## 8 M. T. et al CESCR Spain E/C.12… 11 Oct 2019 "" CESCR-1… 110/2019 Discontinuance … https://j…
## 9 M. P. y otros CESCR Spain E/C.12… 11 Oct 2019 "hous… CESCR-1… 096/2019 Discontinuance … https://j…
## 10 Z. P. y otros CESCR Spain E/C.12… 11 Oct 2019 "hous… CESCR-1… 043/2018 Discontinuance … https://j…
## # … with 51 more rows
get_details(xdf$detail_url[10])
## # A tibble: 6 x 5
## language doc docx pdf html
## <chr> <chr> <chr> <chr> <chr>
## 1 English http://docstore.ohchr.org/… http://docstore.ohchr.org/… http://docstore.ohchr.org… http://docstore.ohchr.org…
## 2 Français http://docstore.ohchr.org/… http://docstore.ohchr.org/… http://docstore.ohchr.org… http://docstore.ohchr.org…
## 3 Español http://docstore.ohchr.org/… http://docstore.ohchr.org/… http://docstore.ohchr.org… http://docstore.ohchr.org…
## 4 العربية http://docstore.ohchr.org/… http://docstore.ohchr.org/… http://docstore.ohchr.org… http://docstore.ohchr.org…
## 5 中文 http://docstore.ohchr.org/… http://docstore.ohchr.org/… http://docstore.ohchr.org… http://docstore.ohchr.org…
## 6 русский http://docstore.ohchr.org/… http://docstore.ohchr.org/… http://docstore.ohchr.org… http://docstore.ohchr.org…
get_details(2606)
## # A tibble: 3 x 5
## language doc docx pdf html
## <chr> <chr> <chr> <chr> <chr>
## 1 English http://docstore.ohchr.org/… http://docstore.ohchr.org/… http://docstore.ohchr.org… http://docstore.ohchr.org…
## 2 Español http://docstore.ohchr.org/… http://docstore.ohchr.org/… http://docstore.ohchr.org… http://docstore.ohchr.org…
## 3 中文 http://docstore.ohchr.org/… http://docstore.ohchr.org/… http://docstore.ohchr.org… http://docstore.ohchr.org…
I am looking for the right emoji.... I mean this is just great.
Although I managed to download the data in the meantime via a detour (somewhat clumsy rvest (here)) your package will be tremendeously helpful.
Much of what is happening in your code is (still) beyond my understanding, but I'll definitly will dig into it and if possible, will try (!) to contribute.
glad to help! When i get some cycles I'll annotate what's going on in the code since you'll likely be able to apply the idiom to other "scraping" tasks once the process is a bit clearer.
First, many thanks for (again another) very helpful package!
I am trying to use the htmlunit package to scrap results from https://juris.ohchr.org/Search/Documents
(the site doesn’t show any search results unless you select at least one search option, e.g treaty).
While browsing the site/reading the results works (most of the time), I get an error message when trying to retrieve the results with the htmlunit package.
Is this error exclusively related to the website in the sense that it is not properly set up, or is there something to htmlunit what triggers the error? If so, any means to circumvent this error with htmlunit?
If you think this is something better to put on SO let me know.
Thanks again!
Created on 2020-03-05 by the reprex package (v0.3.0)