lgnbhl / BFS

🇨🇭Search and Download Data from the Swiss Federal Statistical Office
https://lgnbhl.github.io/BFS
GNU General Public License v3.0
19 stars 5 forks source link

Error in pxR::read.px #3

Closed bttomio closed 3 years ago

bttomio commented 4 years ago

Hi,

Thanks a lot for your package. It's really useful!

I'm having trouble in downloading data. Here is my code:

meta_en_ind <- bfs_get_metadata("en") %>%
  bfs_search("production")
print(meta_en_ind)
df_ind <- bfs_get_dataset(url_px = meta_en_ind$url_px[1], language = "en")

This is the error message:

trying URL 'https://www.bfs.admin.ch/bfsstatic/dam/assets/13967917/master'
Content type 'application/octet-stream' length unknown
downloaded 542 KB

Error in pxR::read.px(file.path(tempfile_path), na.strings = c("\".\"",  : 
  The input file is malformed: data and varnames length differ
In addition: Warning message:
In scan(filename, what = "character", sep = "\n", quiet = TRUE,  :
  invalid input found on input connection 'C:\Users\Bruno\AppData\Local\Temp\Rtmp6zrwjd/bfs_data_13967917_en.px'

Could you please help me out with this issue?

Many thanks,

Bruno

Session info:

R version 4.0.2 (2020-06-22) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 18363)

Matrix products: default

locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] magrittr_1.5 BFS_0.2.5 OECD_0.2.4 IMFData_0.2.0 anytime_0.3.8 openxlsx_4.1.5 forcats_0.5.0
[8] stringr_1.4.0 purrr_0.3.4 readr_1.3.1 tidyr_1.1.1 ggplot2_3.3.2 tidyverse_1.3.0 tibble_3.0.3
[15] quantmod_0.4.17 TTR_0.24.0 Quandl_2.10.0 xts_0.12-0 zoo_1.8-8 lubridate_1.7.9 dplyr_1.0.2

loaded via a namespace (and not attached): [1] Rcpp_1.0.5 lattice_0.20-41 prettyunits_1.1.1 assertthat_0.2.1 utf8_1.1.4 pxR_0.42.4
[7] R6_2.4.1 cellranger_1.1.0 plyr_1.8.6 backports_1.1.9 reprex_0.3.0 rsdmx_0.5-14
[13] httr_1.4.2 pillar_1.4.6 progress_1.2.2 rlang_0.4.7 curl_4.3 readxl_1.3.1
[19] rstudioapi_0.11 blob_1.2.1 pins_0.4.3 selectr_0.4-2 RCurl_1.98-1.2 munsell_0.5.0
[25] broom_0.7.0 compiler_4.0.2 modelr_0.1.8 janitor_2.0.1 pkgconfig_2.0.3 tidyselect_1.1.0 [31] XML_3.99-0.5 fansi_0.4.1 crayon_1.3.4 dbplyr_1.4.4 withr_2.2.0 rappdirs_0.3.1
[37] bitops_1.0-6 grid_4.0.2 jsonlite_1.7.0 gtable_0.3.0 lifecycle_0.2.0 DBI_1.1.0
[43] scales_1.1.1 zip_2.1.0 cli_2.0.2 stringi_1.4.6 reshape2_1.4.4 fs_1.5.0
[49] snakecase_0.11.0 xml2_1.3.2 filelock_1.0.2 ellipsis_0.3.1 generics_0.0.2 vctrs_0.3.2
[55] RJSONIO_1.3-1.4 tools_4.0.2 glue_1.4.1 hms_0.5.3 yaml_2.2.1 colorspace_1.4-1 [61] rvest_0.3.6 haven_2.3.1

lgnbhl commented 4 years ago

Hi Bruno,

Thank you for raising this issue. And sorry for the late answer.

The error is produced by the fact that the function read.px() from the R package {pxR}, a dependency of {BFS}, fails to read the specific PX file you selected in your code. As the function bfs_get_dataset() works well with other files, the problem may come from the internal structure of the PX file you selected.

But surprisingly your code works fine on my Mac. This is strange. It seems to be a bug specific to Windows. I will investigate more and hopefully come back with a solution.

Best, FĂ©lix

bttomio commented 4 years ago

Thank you for your reply, FĂ©lix!

My guess is that it's related to the slash. This "C:\Users\Bruno\AppData\Local\Temp\Rtmp6zrwjd/bfs_data_13967917_en.px" should be "C:/Users/Bruno/AppData/Local/Temp/Rtmp6zrwjd/bfs_data_13967917_en.px" to work correctly in Windows.

lgnbhl commented 4 years ago

Okay so I finally took some time to dig further into this bug.

This issue comes from the fileEncoding argument of the scan() function used inside the pxR::read.px() function.

I fixed the issue by forcing the encoding to be "latin1" and pushed the new package version on Github. Please let me know if this code now works for you:

devtools::install_github("lgnbhl/BFS")
library(magrittr)

meta_en_ind <- bfs_get_metadata("en") %>%
  bfs_search("production")
print(meta_en_ind)
df_ind <- bfs_get_dataset(url_px = meta_en_ind$url_px[1], language = "en")

Once again a big thanks for sharing with me this bug!

bttomio commented 4 years ago

Hi FĂ©lix! Thanks a lot for your reply.

I'm still getting an error with the last line of the code. Could you please check it out?

devtools::install_github("lgnbhl/BFS")
#> Skipping install of 'BFS' from a github remote, the SHA1 (42f9be53) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(BFS)
library(magrittr)

meta_en_ind <- bfs_get_metadata("en") %>%
  bfs_search("production")
print(meta_en_ind)
#> # A tibble: 3 x 6
#>   title         observation_peri~ published  source   url_bfs        url_px     
#>   <chr>         <chr>             <chr>      <chr>    <chr>          <chr>      
#> 1 Secondary Se~ 1.10.2011-30.6.2~ 20.08.2020 Federal~ https://www.b~ https://ww~
#> 2 Secondary Se~ 1.1.1999-30.6.20~ 20.08.2020 Federal~ https://www.b~ https://ww~
#> 3 Secondary Se~ 1999-2019         25.05.2020 Federal~ https://www.b~ https://ww~
df_ind <- bfs_get_dataset(url_px = meta_en_ind$url_px[1], language = "en")
#> Warning in scan(filename, what = "character", sep = "\n",
#> quiet = TRUE, : invalid input found on input connection 'C:
#> \Users\Bruno\AppData\Local\Temp\RtmpOsARHl/bfs_data_13967917_en.px'
#> Error in pxR::read.px(file.path(tempfile_path), encoding = "latin1", na.strings = c("\".\"", : The input file is malformed: data and varnames length differ

Created on 2020-10-01 by the reprex package (v0.3.0)

lgnbhl commented 3 years ago

Hi Bruno,

The bug has been fixed in the last CRAN version of BFS (0.3.0 now). The R code above is now working on my Windows.

The fix has been kindly share by Fachstelle Statistik Kanton Zug, i.e. @statzg.

Please let me know if it works for you so I can close this issue :).

Best, FĂ©lix

bttomio commented 3 years ago

Hi FĂ©lix,

Thanks a lot for your reply. Glad that you could find a solution with the help of @statzg. It's working now, on Windows. Nevertheless, data is in German. I've also tried to run the code on Linux (Ubuntu), which is not working at all. Here is a feedback:

After typing df_ind <- bfs_get_dataset(url_px = meta_en_ind$url_px[1], language = "en"), I'm getting this error message: Failed to translate name. If I repeat the command, it works. Nonetheless, it's not considering the language option. As you can see, it's in German:

    > df_ind
    # A tibble: 32,472 x 6
       month   branch        variable   indices_changes adjustment  value
       <fct>   <fct>         <fct>      <fct>           <fct>       <dbl>
     1 2010M10 B-E Industrie Produktion Indizes         Unbereinigt 101. 
     2 2010M11 B-E Industrie Produktion Indizes         Unbereinigt 108. 
     3 2010M12 B-E Industrie Produktion Indizes         Unbereinigt 105. 
     4 2011M01 B-E Industrie Produktion Indizes         Unbereinigt  89.5
     5 2011M02 B-E Industrie Produktion Indizes         Unbereinigt  95.8
     6 2011M03 B-E Industrie Produktion Indizes         Unbereinigt 101. 
     7 2011M04 B-E Industrie Produktion Indizes         Unbereinigt  90.3
     8 2011M05 B-E Industrie Produktion Indizes         Unbereinigt 104. 
     9 2011M06 B-E Industrie Produktion Indizes         Unbereinigt  94.3
    10 2011M07 B-E Industrie Produktion Indizes         Unbereinigt  93.2
    # ... with 32,462 more rows

Therefore, it's working, but not as expected.

    Here is the information for this session on Windows:

    R version 4.0.3 (2020-10-10)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    Running under: Windows 10 x64 (build 19042)

    Matrix products: default

    locale:
    [1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
    [4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

    attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     

    other attached packages:
    [1] magrittr_2.0.1 BFS_0.3.0     

    loaded via a namespace (and not attached):
     [1] ggrepel_0.9.1     Rcpp_1.0.6        lubridate_1.7.10  lattice_0.20-41   prettyunits_1.1.1 ps_1.6.0          zoo_1.8-9        
     [8] assertthat_0.2.1  rprojroot_2.0.2   utf8_1.2.1        R6_2.5.0          plyr_1.8.6        pxR_0.42.4        backports_1.2.1  
    [15] httr_1.4.2        ggplot2_3.3.3     pillar_1.5.1      rlang_0.4.10      progress_1.2.2    curl_4.3          rstudioapi_0.13  
    [22] callr_3.5.1       pins_0.4.5        desc_1.3.0        devtools_2.3.2    selectr_0.4-2     stringr_1.4.0     munsell_0.5.0    
    [29] anytime_0.3.9     compiler_4.0.3    janitor_2.1.0     pkgconfig_2.0.3   pkgbuild_1.2.0    tidyselect_1.1.0  tibble_3.1.0     
    [36] fansi_0.4.2       crayon_1.4.1      dplyr_1.0.5       withr_2.4.1       rappdirs_0.3.3    grid_4.0.3        jsonlite_1.7.2   
    [43] gtable_0.3.0      lifecycle_1.0.0   DBI_1.1.1         scales_1.1.1      cli_2.3.1         stringi_1.5.3     cachem_1.0.4     
    [50] reshape2_1.4.4    fs_1.5.0          remotes_2.2.0     testthat_3.0.2    snakecase_0.11.0  xml2_1.3.2        filelock_1.0.2   
    [57] ellipsis_0.3.1    xts_0.12.1        generics_0.1.0    vctrs_0.3.6       cowplot_1.1.1     tidyRSS_2.0.3     tools_4.0.3      
    [64] RJSONIO_1.3-1.4   glue_1.4.2        purrr_0.3.4       hms_1.0.0         yaml_2.2.1        processx_3.5.0    pkgload_1.2.0    
    [71] fastmap_1.1.0     colorspace_2.0-0  sessioninfo_1.1.1 rvest_1.0.0       memoise_2.0.0     usethis_2.0.1

On Linux, this is the error message after df_ind <- bfs_get_dataset(url_px = meta_en_ind$url_px[1], language = "en"):

    trying URL 'https://www.bfs.admin.ch/bfsstatic/dam/assets/16044446/master'
    downloaded 574 KB

    Error in gsub("\"......\"", "\"....\"", x, fixed = TRUE) : 
      input string 16 is invalid in this locale

Here is the session info for this case:

  R version 4.0.4 (2021-02-15)
  Platform: x86_64-pc-linux-gnu (64-bit)
  Running under: Ubuntu 20.04.2 LTS

  Matrix products: default
  BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
  LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

  locale:
    [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
  [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

  attached base packages:
    [1] stats     graphics  grDevices utils     datasets  methods   base     

  other attached packages:
    [1] BFS_0.3.0      magrittr_2.0.1

  loaded via a namespace (and not attached):
    [1] tidyRSS_2.0.3     httr_1.4.2        pkgload_1.2.0     jsonlite_1.7.2    viridisLite_0.3.0 assertthat_0.2.1  selectr_0.4-2    
  [8] yaml_2.2.1        remotes_2.2.0     progress_1.2.2    ggrepel_0.9.1     sessioninfo_1.1.1 pillar_1.5.1      backports_1.2.1  
  [15] lattice_0.20-41   glue_1.4.2        digest_0.6.27     rvest_1.0.0       snakecase_0.11.0  colorspace_2.0-0  plyr_1.8.6       
  [22] cowplot_1.1.1     htmltools_0.5.1.1 pkgconfig_2.0.3   devtools_2.3.2    purrr_0.3.4       scales_1.1.1      webshot_0.5.2    
  [29] processx_3.5.0    svglite_2.0.0     tibble_3.1.0      generics_0.1.0    ggplot2_3.3.3     usethis_2.0.1     ellipsis_0.3.1   
  [36] cachem_1.0.4      withr_2.4.1       janitor_2.1.0     cli_2.3.1         RJSONIO_1.3-1.4   crayon_1.4.1      memoise_2.0.0    
  [43] evaluate_0.14     ps_1.6.0          fs_1.5.0          fansi_0.4.2       anytime_0.3.9     xts_0.12.1        xml2_1.3.2       
  [50] pkgbuild_1.2.0    pins_0.4.5        tools_4.0.4       prettyunits_1.1.1 hms_1.0.0         lifecycle_1.0.0   stringr_1.4.0    
  [57] munsell_0.5.0     pxR_0.42.4        callr_3.5.1       kableExtra_1.3.4  compiler_4.0.4    systemfonts_1.0.1 rlang_0.4.10     
  [64] grid_4.0.4        rstudioapi_0.13   rappdirs_0.3.3    rmarkdown_2.7     testthat_3.0.2    gtable_0.3.0      DBI_1.1.1        
  [71] curl_4.3          reshape2_1.4.4    R6_2.5.0          gridExtra_2.3     zoo_1.8-9         lubridate_1.7.10  knitr_1.31       
  [78] dplyr_1.0.5       fastmap_1.1.0     utf8_1.2.1        filelock_1.0.2    rprojroot_2.0.2   desc_1.3.0        stringi_1.5.3    
  [85] Rcpp_1.0.6        vctrs_0.3.6       tidyselect_1.1.0  xfun_0.22

Thanks a lot for your package and sorry from bringing up another issue. Please let me know if I can help you somehow.

Best,

Bruno

lgnbhl commented 3 years ago

Hi @bttomio,

I just merged the fix proposed by @zambujo. Could you please let me know if the package is now working on Ubuntu?

Please note that the function bfs_get_dataset() is now only providing data in the German language.

Best, FĂ©lix

bttomio commented 3 years ago

Hi FĂ©lix!

Thanks a lot for this update. It's cool to see your package progressing. I'm looking forward to the next updates, notably the ability to extract data in English or French.

With the current version, the error message is gone on Ubuntu.

Best, Bruno

PS: here is my session information just for the record:

R version 4.0.5 (2021-03-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.2 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
 [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] magrittr_2.0.1 BFS_0.3.0     

loaded via a namespace (and not attached):
 [1] tidyRSS_2.0.3     httr_1.4.2        pkgload_1.2.1     jsonlite_1.7.2    viridisLite_0.4.0 assertthat_0.2.1  selectr_0.4-2     yaml_2.2.1       
 [9] remotes_2.3.0     progress_1.2.2    ggrepel_0.9.1     sessioninfo_1.1.1 pillar_1.6.0      backports_1.2.1   lattice_0.20-44   glue_1.4.2       
[17] digest_0.6.27     rvest_1.0.0       snakecase_0.11.0  colorspace_2.0-1  cowplot_1.1.1     htmltools_0.5.1.1 plyr_1.8.6        pkgconfig_2.0.3  
[25] devtools_2.4.1    purrr_0.3.4       scales_1.1.1      webshot_0.5.2     processx_3.5.2    svglite_2.0.0     tibble_3.1.1      generics_0.1.0   
[33] ggplot2_3.3.3     usethis_2.0.1     ellipsis_0.3.2    cachem_1.0.4      withr_2.4.2       janitor_2.1.0     cli_2.5.0         RJSONIO_1.3-1.4  
[41] crayon_1.4.1      memoise_2.0.0     evaluate_0.14     ps_1.6.0          fs_1.5.0          fansi_0.4.2       anytime_0.3.9     xts_0.12.1       
[49] xml2_1.3.2        pkgbuild_1.2.0    pins_0.4.5        tools_4.0.5       prettyunits_1.1.1 hms_1.0.0         lifecycle_1.0.0   stringr_1.4.0    
[57] munsell_0.5.0     pxR_0.42.4        callr_3.7.0       kableExtra_1.3.4  compiler_4.0.5    systemfonts_1.0.1 rlang_0.4.11      grid_4.0.5       
[65] rstudioapi_0.13   rappdirs_0.3.3    rmarkdown_2.8     testthat_3.0.2    gtable_0.3.0      DBI_1.1.1         curl_4.3.1        reshape2_1.4.4   
[73] R6_2.5.0          zoo_1.8-9         lubridate_1.7.10  knitr_1.33        dplyr_1.0.6       fastmap_1.1.0     utf8_1.2.1        filelock_1.0.2   
[81] rprojroot_2.0.2   desc_1.3.0        stringi_1.5.3     Rcpp_1.0.6        vctrs_0.3.8       tidyselect_1.1.1  xfun_0.22