lgnbhl / BFS

πŸ‡¨πŸ‡­Search and Download Data from the Swiss Federal Statistical Office
https://lgnbhl.github.io/BFS
GNU General Public License v3.0
19 stars 5 forks source link

pxweb_advanced_get Too Many Requests (RFC 6585) (HTTP 429) error #7

Open elliotbeck opened 1 year ago

elliotbeck commented 1 year ago

Dear FΓ©lix

Thanks a lot for the nice package you provide! I came across the following issue when downloading a rather long query:

> BFS::bfs_get_data(number_bfs = "px-x-1003020000_103", language = "de")
|=========================================================                            |  67%
Error in pxweb_advanced_get(url = url, query = query, verbose = verbose) : 
  Too Many Requests (RFC 6585) (HTTP 429).

Maybe this could be resolved by increasing the batch size or adding a delay?

best Elliot

lgnbhl commented 1 year ago

Dear Elliot,

Thank you very much for using my package and reporting this issue!

The "BFS" package is using under the hood the R package {pxweb} to query the BFS API. After a quick look I haven't found any option to increase the batch size or add a delay. I will investigate more. Feel free to share in this issue any discovery or suggestion from your side.

Another solution is to reduce the size of the dataset using the query argument of the function bfs_get_data(). I discovered another bug in my code when using the query argument. I just pushed the fix on GitHub. So the following R code works only with the dev version of the "BFS" package on Github. I will soon push this fix on CRAN.

Please let me know if this works for you.

# Install dev version
devtools::install_github("lgnbhl/BFS")
library(BFS)

# choose a BFS number and language
number_bfs <- "px-x-1003020000_103"
language <- "en"

# create the BFS api url
pxweb_api_url <- paste0("https://www.pxweb.bfs.admin.ch/api/v1/", 
                        language, "/", number_bfs, "/", number_bfs, ".px")

# Get BFS table metadata using {pxweb}
px_meta <- pxweb::pxweb_get(pxweb_api_url)

# list variables items
str(px_meta$variables)

# Manually create BFS query dimensions
# Use `code` and `values` elements in `px_meta$variables`
# Use "*" to select all
dims <- list("Jahr" = c("2020", "2021"),
             "Monat" = c("YYYY"),
             "Indikator" = c("*"))

# Query BFS data with specific dimensions
BFS::bfs_get_data(
  number_bfs = number_bfs,
  language = language,
  query = dims
  )
# A tibble: 4 Γ— 4
  Year  Month             Indicator    Hotel…¹
  <chr> <chr>             <chr>          <dbl>
1 2020  Total of the year Arrivals      1.07e7
2 2020  Total of the year Overnight s…  2.37e7
3 2021  Total of the year Arrivals      1.37e7
4 2021  Total of the year Overnight s…  2.96e7
# … with abbreviated variable name
#   ¹​`Hotel sector: arrivals and overnight stays of open establishments`

Best, Felix

philipp-baumann commented 1 year ago

Short question @lgnbhl , I get this message, too, but I guess the bfs has added purposely burdens on the API for security reasons. Is there other solution to surpass the API limits in clever ways apart from using some VPN switchers and other networking magic?

lgnbhl commented 1 year ago

@philipp-baumann no, I am not aware of other solutions to surpass the API limits.

lgnbhl commented 1 year ago

Hi @elliotbeck and @philipp-baumann

I ran again the R code shared for this issue and it works just fine for me now.

Is the following R code still throwing an error to you?

BFS::bfs_get_data(number_bfs = "px-x-1003020000_103", language = "de")

Maybe they have changed something in the BFS API or in the {pxweb} R package since this issue has been submitted...

By the way, the new version of the BFS package (for now only available on GitHub) provides a new function to download locally any file by BFS number (or asset number). For the case of a large PX file, this speeds up the R code a lot.

devtools::install_github("lgnbhl/BFS")

BFS::bfs_download_asset(
  number_bfs = "px-x-1003020000_103", #number_asset also possible
  destfile = "px-x-1003020000_103.px"
)

library(pxR) # install.packages("pxR")
large_dataset <- pxR::read.px(filename = "px-x-1003020000_103.px") |>
  as.data.frame()
## # A tibble: 539,448 Γ— 6
##    Indikator    Herkunftsland         Tourismusregion Monat       Jahr     value
##    <fct>        <fct>                 <fct>           <fct>       <fct>    <dbl>
##  1 AnkΓΌnfte     Herkunftsland - Total Schweiz         Jahrestotal 2005  13802796
##  2 LogiernΓ€chte Herkunftsland - Total Schweiz         Jahrestotal 2005  32943736
##  3 AnkΓΌnfte     Schweiz               Schweiz         Jahrestotal 2005   6573945
##  4 LogiernΓ€chte Schweiz               Schweiz         Jahrestotal 2005  14622420
##  5 AnkΓΌnfte     Baltische Staaten     Schweiz         Jahrestotal 2005     13115
##  6 LogiernΓ€chte Baltische Staaten     Schweiz         Jahrestotal 2005     32871
##  7 AnkΓΌnfte     Deutschland           Schweiz         Jahrestotal 2005   2007203
##  8 LogiernΓ€chte Deutschland           Schweiz         Jahrestotal 2005   5563695
##  9 AnkΓΌnfte     Frankreich            Schweiz         Jahrestotal 2005    542502
## 10 LogiernΓ€chte Frankreich            Schweiz         Jahrestotal 2005   1225619
## # β„Ή 539,438 more rows

Please note that reading a PX file using pxR::read.px() gives access only to the German version.

philipp-baumann commented 1 year ago

Thanks! I'll give it a test tomorrow and let you know. Cheers

elliotbeck commented 1 year ago

I still get the Too Many Requests (RFC 6585) (HTTP 429) error. Best Elliot

philipp-baumann commented 1 year ago

With a Swiss IP I get "px-x-1003020000_103.px" without error. Both with the batched approach and new BFS::bfs_download_asset(). Is there maybe an API limit per time? I am using the latest CRAN version of {pxweb}.

r$> sessioninfo::session_info()
─ Session info ───────────────────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.1.3 (2022-03-10)
 os       Ubuntu 22.04.2 LTS
 system   x86_64, linux-gnu
 ui       X11
 language
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/Zurich
 date     2023-07-27
 pandoc   2.9.2.1 @ /usr/bin/pandoc

─ Packages ───────────────────────────────────────────────────────────────────────────────
 ! package     * version    date (UTC) lib source
   anytime       0.3.9      2020-08-27 [1] CRAN (R 4.1.3)
   backports     1.4.1      2021-12-13 [1] CRAN (R 4.1.2)
   BFS           0.5.1.999  2023-07-27 [1] Github (lgnbhl/BFS@a583276)
   bit           4.0.5      2022-11-15 [1] CRAN (R 4.1.3)
   bit64         4.0.5      2020-08-30 [1] CRAN (R 4.1.2)
   blob          1.2.3      2022-04-10 [1] CRAN (R 4.1.3)
   cachem        1.0.8      2023-05-01 [1] CRAN (R 4.1.3)
   callr         3.7.3      2022-11-02 [1] CRAN (R 4.1.3)
   checkmate     2.2.0      2023-04-27 [1] CRAN (R 4.1.3)
   cli           3.6.1      2023-03-23 [1] CRAN (R 4.1.3)
   crancache     0.0.0.9001 2022-01-20 [1] Github (r-lib/crancache@7ea4e47)
   cranlike      1.0.2      2018-11-26 [1] CRAN (R 4.1.2)
   crayon        1.5.2      2022-09-29 [1] CRAN (R 4.1.3)
 V curl          5.0.0      2023-06-07 [1] CRAN (R 4.1.3) (on disk 5.0.1)
   DBI           1.1.3      2022-06-18 [1] RSPM (R 4.1.0)
   debugme       1.1.0      2017-10-22 [1] CRAN (R 4.1.2)
   desc          1.4.2      2022-09-08 [1] CRAN (R 4.1.3)
   digest        0.6.33     2023-07-07 [1] CRAN (R 4.1.3)
   dplyr         1.1.2      2023-04-20 [1] CRAN (R 4.1.3)
   fansi         1.0.4      2023-01-22 [1] CRAN (R 4.1.3)
   fastmap       1.1.1      2023-02-24 [1] CRAN (R 4.1.3)
   generics      0.1.3      2022-07-05 [1] CRAN (R 4.1.3)
   glue          1.6.2      2022-02-24 [1] CRAN (R 4.1.2)
   httr          1.4.6      2023-05-08 [1] CRAN (R 4.1.3)
   httr2         0.2.3      2023-05-08 [1] CRAN (R 4.1.3)
   janitor       2.2.0      2023-02-02 [1] CRAN (R 4.1.3)
   jsonlite      1.8.7      2023-06-29 [1] CRAN (R 4.1.3)
   lifecycle     1.0.3      2022-10-07 [1] CRAN (R 4.1.3)
   lubridate     1.9.2      2023-02-10 [1] CRAN (R 4.1.3)
   magrittr      2.0.3      2022-03-30 [1] CRAN (R 4.1.3)
   memoise       2.0.1      2021-11-26 [1] CRAN (R 4.1.2)
   parsedate     1.2.1      2021-04-20 [1] CRAN (R 4.1.2)
   pillar        1.9.0      2023-03-22 [1] CRAN (R 4.1.3)
   pkgbuild      1.4.0      2022-11-27 [1] CRAN (R 4.1.3)
   pkgconfig     2.0.3      2019-09-22 [1] CRAN (R 4.1.2)
   plyr        * 1.8.7      2022-03-24 [1] CRAN (R 4.1.3)
   prettyunits   1.1.1      2020-01-24 [1] CRAN (R 4.1.2)
   processx      3.8.2      2023-06-30 [1] CRAN (R 4.1.3)
   ps            1.7.5      2023-04-18 [1] CRAN (R 4.1.3)
   purrr         1.0.1      2023-01-10 [1] CRAN (R 4.1.3)
   pxR         * 0.42.7     2022-11-23 [1] CRAN (R 4.1.3)
   pxweb         0.16.2     2022-10-31 [1] CRAN (R 4.1.3)
   R6            2.5.1      2021-08-19 [1] CRAN (R 4.1.2)
   rappdirs      0.3.3      2021-01-31 [1] CRAN (R 4.1.2)
   Rcpp          1.0.11     2023-07-06 [1] CRAN (R 4.1.3)
   rematch2      2.1.2      2020-05-01 [1] CRAN (R 4.1.2)
   remotes     * 2.4.2      2021-11-30 [1] CRAN (R 4.1.3)
   reshape2    * 1.4.4      2020-04-09 [1] CRAN (R 4.1.3)
   RJSONIO     * 1.3-1.6    2021-09-16 [1] CRAN (R 4.1.2)
   rlang         1.1.1      2023-04-28 [1] CRAN (R 4.1.3)
   rprojroot     2.0.3      2022-04-02 [1] CRAN (R 4.1.3)
   RSQLite       2.2.14     2022-05-07 [1] CRAN (R 4.1.3)
   sessioninfo   1.2.2      2021-12-06 [1] CRAN (R 4.1.2)
   snakecase     0.11.0     2019-05-25 [1] CRAN (R 4.1.3)
   stringi       1.7.12     2023-01-11 [1] CRAN (R 4.1.3)
   stringr     * 1.5.0      2022-12-02 [1] CRAN (R 4.1.3)
   tibble        3.2.1      2023-03-20 [1] CRAN (R 4.1.3)
   tidyRSS       2.0.7      2023-03-05 [1] CRAN (R 4.1.3)
   tidyselect    1.2.0      2022-10-10 [1] CRAN (R 4.1.3)
   timechange    0.2.0      2023-01-11 [1] CRAN (R 4.1.3)
   utf8          1.2.3      2023-01-31 [1] CRAN (R 4.1.3)
   vctrs         0.6.3      2023-06-14 [1] CRAN (R 4.1.3)
   withr         2.5.0      2022-03-03 [1] CRAN (R 4.1.3)
   xml2          1.3.5      2023-07-06 [1] CRAN (R 4.1.3)

 [1] /home/philipp/R/x86_64-pc-linux-gnu-library/4.1
 [2] /opt/R/4.1.3/lib/R/library

 V ── Loaded and on-disk version mismatch.

──────────────────────────────────────────────────────────────────────────────────────────
lgnbhl commented 1 year ago

@philipp-baumann yes, there is a time window limit of 10: https://www.pxweb.bfs.admin.ch/api/v1/de/?config.

philipp-baumann commented 1 year ago

@philipp-baumann yes, there is a time window limit of 10: https://www.pxweb.bfs.admin.ch/api/v1/de/?config.

thanks @lgnbhl for pointing to that config.

lgnbhl commented 1 year ago

I ran BFS::bfs_get_data(number_bfs = "px-x-1003020000_103") earlier today and I got the error message again. But now, the function works again. Not sure how to explain the change: could be the BFS API server...

The error is not caused by a new version of the {pxweb} R package (currently 0.16.2) as they have not pushed a new version since 2022-10-31.

I have updated the documentation to reflect our discussion: https://github.com/lgnbhl/BFS#too-many-requests-error-message

Best, Felix

lgnbhl commented 1 year ago

Please find below an R script showing a programmatic solution to query a large BFS dataset.

This R code creates a list of smaller queries and join them using purrr::pmap_dfr().

To avoid getting an error message due to the BFS API limits, I added the new argument "delay" in bfs_get_data() which calls Sys.sleep(). The code below adds a 10 seconds delay before the query.

Be sure to have a least v.0.5.6 of the BFS package installed.

#devtools::install_github("lgnbhl/BFS") # for BFS v.0.5.6
library(BFS)
library(purrr)

# should at least use version 0.5.6
packageVersion("BFS") >= "0.5.6"

# choose a BFS number and language
number_bfs <- "px-x-1003020000_103"
language <- "en"

# get metadata
meta <- bfs_get_metadata(number_bfs = number_bfs, language = language)

# create dimension object
dims <- meta$values
names(dims) <- meta$code

# split 1st dimension "Jahr" in chunks of 1 element
# NOTE: depending of the data, other dimension should be used, e.g. dims[[2]]
dims1 <- dims[[1]]
dim_splited <- split(dims1, cut(seq_along(dims1), length(dims1), labels = FALSE))
names(dim_splited) <- rep(names(dims)[1], length(dim_splited))

# create query list
query_list <- vector(mode = "list", length = length(dim_splited))
for (i in seq_along(dim_splited)) {
  query_list[[i]] <- c(dim_splited[i], dims[-1])
}
names(query_list) <- rep("query", length(query_list))

# list of arguments for loop
args_list <- list(
  number_bfs = rep(number_bfs, length(query_list)),
  language = rep(language, length(query_list)),
  delay = rep(10, length(query_list)), # 10 seconds delay before query
  query = query_list
)

# loop with smaller queries using bfs_get_data()
df <- purrr::pmap_dfr(.l = args_list, .f = bfs_get_data, .progress = TRUE)
df
## # A tibble: 539,448 Γ— 6
##   Year  Month             `Tourist region` Visitors' country of resi…¹ Indicator
##   <chr> <chr>             <chr>            <chr>                       <chr>    
##  1 2005  Total of the year Switzerland      Visitors' country of resid… Arrivals 
##  2 2005  Total of the year Switzerland      Visitors' country of resid… Overnigh…
##  3 2005  Total of the year Switzerland      Switzerland                 Arrivals 
##  4 2005  Total of the year Switzerland      Switzerland                 Overnigh…
##  5 2005  Total of the year Switzerland      Baltic States               Arrivals 
##  6 2005  Total of the year Switzerland      Baltic States               Overnigh…
##  7 2005  Total of the year Switzerland      Germany                     Arrivals 
##  8 2005  Total of the year Switzerland      Germany                     Overnigh…
##  9 2005  Total of the year Switzerland      France                      Arrivals 
## 10 2005  Total of the year Switzerland      France                      Overnigh…
## # β„Ή 539,438 more rows
## # β„Ή abbreviated name: ¹​`Visitors' country of residence`
## # β„Ή 1 more variable:
## #   `Hotel sector: arrivals and overnight stays of open establishments` <dbl>

@philipp-baumann @elliotbeck feel free to let me know if this solution works for you :)