salimk / Rcrawler

An R web crawler and scraper
http://www.sciencedirect.com/science/article/pii/S2352711017300110
Other
348 stars 93 forks source link

Issues with encoding: texts in downloaded HTML file are garbled #36

Closed yusuzech closed 5 years ago

yusuzech commented 6 years ago

I tried the example in tutorial and find that texts in downloaded HTML are garbled(I opened it with chrome in UTF-8 encoding): garbled text (left is downloaded version and right is online version).

I tried to switch system locale, it was Chinese and I switched to English. But it still doesn't work.

The encoding should be recognized correctly:

Id  Url Stats   Level   OUT IN  Http Resp   Content Type    Encoding    Accuracy
1   http://www.glofile.com  finished    0   13  1   200 text/html   UTF-8   
#Doesn't work
> Sys.getlocale()
[1] "LC_COLLATE=Chinese (Simplified)_China.936;LC_CTYPE=Chinese (Simplified)_China.936;LC_MONETARY=Chinese (Simplified)_China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_China.936"

#Also doesn't work
> Sys.setlocale("LC_ALL","English")
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
> devtools::session_info()
Session info ------------------------------------------------------------------------------------------------------------
 setting  value                       
 version  R version 3.4.4 (2018-03-15)
 system   x86_64, mingw32             
 ui       RStudio (1.1.383)           
 language (EN)                        
 collate  English_United States.1252  
 tz       America/Los_Angeles         
 date     2018-04-28                  

Packages ----------------------------------------------------------------------------------------------------------------
 package    * version    date       source                             
 base       * 3.4.4      2018-03-15 local                              
 clipr        0.4.0      2017-11-03 CRAN (R 3.4.2)                     
 codetools    0.2-15     2016-10-05 CRAN (R 3.4.4)                     
 compiler     3.4.4      2018-03-15 local                              
 curl         3.1        2017-12-12 CRAN (R 3.4.3)                     
 data.table   1.10.4-3   2017-10-27 CRAN (R 3.4.3)                     
 datasets   * 3.4.4      2018-03-15 local                              
 devtools     1.13.4     2017-11-09 CRAN (R 3.4.3)                     
 digest       0.6.14     2018-01-14 CRAN (R 3.4.3)                     
 doParallel   1.0.11     2017-09-28 CRAN (R 3.4.3)                     
 foreach      1.4.4      2017-12-12 CRAN (R 3.4.3)                     
 graphics   * 3.4.4      2018-03-15 local                              
 grDevices  * 3.4.4      2018-03-15 local                              
 httr         1.3.1      2017-08-20 CRAN (R 3.4.1)                     
 iterators    1.0.9      2017-12-12 CRAN (R 3.4.3)                     
 magrittr     1.5        2014-11-22 CRAN (R 3.4.1)                     
 memoise      1.1.0      2017-04-21 CRAN (R 3.4.1)                     
 methods    * 3.4.4      2018-03-15 local                              
 parallel     3.4.4      2018-03-15 local                              
 purrr        0.2.4      2017-10-18 CRAN (R 3.4.2)                     
 R6           2.2.2      2017-06-17 CRAN (R 3.4.1)                     
 Rcpp         0.12.15    2018-01-20 CRAN (R 3.4.3)                     
 Rcrawler   * 0.1.7-0    2017-11-01 CRAN (R 3.4.4)                     
 rlang        0.1.6      2017-12-21 CRAN (R 3.4.3)                     
 rstudioapi   0.7.0-9000 2018-01-17 Github (rstudio/rstudioapi@109e593)
 selectr      0.3-1      2016-12-19 CRAN (R 3.4.1)                     
 stats      * 3.4.4      2018-03-15 local                              
 stringi      1.1.6      2017-11-17 CRAN (R 3.4.2)                     
 stringr      1.2.0      2017-02-18 CRAN (R 3.4.2)                     
 tools        3.4.4      2018-03-15 local                              
 utils      * 3.4.4      2018-03-15 local                              
 withr        2.1.1      2017-12-19 CRAN (R 3.4.3)                     
 XML          3.98-1.9   2017-06-19 CRAN (R 3.4.1)                     
 xml2         1.1.1      2017-01-24 CRAN (R 3.4.1)                     
 yaml         2.1.16     2017-12-12 CRAN (R 3.4.3)    
salimk commented 5 years ago

Hello, Thank you for reporting this issue. we have fixed encoding in saved HTML files, it will be available on cran in the next few days capture20-10-2

salimk commented 5 years ago

Rcrawler v0.1.9 is released with a lot of features, subscribe to our mailing list to stay updated http://eepurl.com/dMv_7s