rstudio / chromote

Chrome Remote Interface for R
https://rstudio.github.io/chromote/
156 stars 20 forks source link

ERR_HTTP2_PROTOCOL_ERROR when trying to navigate to a website #166

Closed nclsbarreto closed 2 months ago

nclsbarreto commented 3 months ago

I am trying to learn to use chromote and am generally doing pretty well. But I have run into an issue with this website.

library(chromote)
url <- "https://health.usnews.com/best-hospitals/search"

tab <- ChromoteSession$new()

tab$Page$navigate("https://www.google.com")

tab$Page$navigate(url)

There is no problem navigating to google, but when I try to navigate to usnews i get "$errorText [1] "net::ERR_HTTP2_PROTOCOL_ERROR""

any help would be appreciated.

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19045)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] RPostgreSQL_0.7-5 tmap_3.3-4        odbc_1.4.2        logger_0.3.0      DBI_1.2.2         glue_1.7.0        httr2_1.0.0       jsonlite_1.8.4    xml2_1.3.3       
[10] chromote_0.2.0    openxlsx_4.2.5.2  dbplyr_2.4.0      rvest_1.0.4       lubridate_1.9.2   forcats_1.0.0     stringr_1.5.1     dplyr_1.1.2       purrr_1.0.1      
[19] readr_2.1.5       tidyr_1.3.0       tibble_3.2.1      ggplot2_3.5.0     tidyverse_2.0.0   pacman_0.5.1     

loaded via a namespace (and not attached):
 [1] sf_1.0-12           bit64_4.0.5         RColorBrewer_1.1-3  httr_1.4.7          tools_4.1.1         utf8_1.2.4          R6_2.5.1            KernSmooth_2.23-20 
 [9] colorspace_2.1-0    raster_3.6-20       withr_3.0.1         sp_1.6-0            tidyselect_1.2.1    processx_3.8.1      leaflet_2.2.1       curl_5.2.1         
[17] bit_4.0.5           compiler_4.1.1      leafem_0.2.3        cli_3.6.3           scales_1.3.0        classInt_0.4-9      proxy_0.4-27        rappdirs_0.3.3     
[25] digest_0.6.31       base64enc_0.1-3     dichromat_2.0-0.1   pkgconfig_2.0.3     htmltools_0.5.7     fastmap_1.1.1       htmlwidgets_1.6.4   rlang_1.1.4        
[33] rstudioapi_0.15.0   generics_0.1.3      crosstalk_1.2.1     zip_2.3.0           magrittr_2.0.3      Rcpp_1.0.10         munsell_0.5.0       fansi_1.0.6        
[41] abind_1.4-5         lifecycle_1.0.4     terra_1.7-23        stringi_1.7.12      leafsync_0.1.0      tmaptools_3.1-1     grid_4.1.1          blob_1.2.4         
[49] parallel_4.1.1      promises_1.2.1      lattice_0.20-44     stars_0.6-4         hms_1.1.3           ps_1.7.5            pillar_1.9.0        codetools_0.2-18   
[57] XML_3.99-0.14       BiocManager_1.30.22 vctrs_0.6.5         png_0.1-8           tzdb_0.4.0          gtable_0.3.4        lwgeom_0.2-11       e1071_1.7-13       
[65] later_1.3.2         class_7.3-19        viridisLite_0.4.2   websocket_1.4.1     units_0.8-1         timechange_0.2.0   
gadenbuie commented 2 months ago

I'm pretty certain that the website you're trying to open looks at the User Agent string in the request and is seeing "Headless Chrome" in that field and is then blocking access. Clearly they are trying to discourage web scraping efforts.

nclsbarreto commented 2 months ago

Fantastic. That is what I had concluded as well, but I'm not exactly a pro (particularly at HTML) so I wanted to confirm. Thank you.