rOpenGov / eurostat

R tools for Eurostat data
http://ropengov.github.io/eurostat
Other
234 stars 46 forks source link

get_eurostat_json fails with time / time_period parameter on quarterly data #260

Closed pitkant closed 8 months ago

pitkant commented 1 year ago

The following code produces an error:

tbl_hp <- get_eurostat("TIPSHO40", filter = list(time = c(2005:2022)))
Error in `[.data.frame`(variables, rev(names(variables))) : 
  undefined columns selected

whereas this works normally:

tbl_hp <- get_eurostat("TIPSHO40", filter = list(sinceTimePeriod = 2005))
Table TIPSHO40 cached at /var/folders/f4/h_r3y60n0nn0qm6qx5hnx1s00000gn/T//RtmpjICMVw/eurostat/TIPSHO40_date_code_FF.rds

With other datasets using time parameters (e.g. filter = list(time(2005:2023)) ) works normally so this issue might be related to the fact that this dataset is quarterly. We will have to explore this.

> head(tbl_hp)
# A tibble: 6 × 5
  freq  unit  geo   time       values
  <chr> <chr> <chr> <date>      <dbl>
1 Q     INX_Q BE    2005-01-01   69.7
2 Q     INX_Q BE    2005-04-01   71.2
3 Q     INX_Q BE    2005-07-01   73.8
4 Q     INX_Q BE    2005-10-01   75.2
5 Q     INX_Q BE    2006-01-01   76.7
6 Q     INX_Q BE    2006-04-01   78.8
> sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.4

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Europe/Helsinki
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] eurostat_3.8.3

loaded via a namespace (and not attached):
 [1] rappdirs_0.3.3     utf8_1.2.3         generics_0.1.3     tidyr_1.3.0        class_7.3-21      
 [6] xml2_1.3.4         KernSmooth_2.23-20 stringi_1.7.12     hms_1.1.3          magrittr_2.0.3    
[11] countrycode_1.4.0  timechange_0.2.0   ISOweek_0.6-2      cellranger_1.1.0   rprojroot_2.0.3   
[16] plyr_1.8.8         jsonlite_1.8.4     e1071_1.7-13       backports_1.4.1    httr_1.4.6        
[21] purrr_1.0.1        fansi_1.0.4        regions_0.1.8      bibtex_0.5.1       cli_3.6.1         
[26] crayon_1.5.2       rlang_1.1.1        bit64_4.0.5        withr_2.5.0        parallel_4.3.0    
[31] tools_4.3.0        tzdb_0.4.0         dplyr_1.1.2        here_1.0.1         broom_1.0.4       
[36] curl_5.0.0         assertthat_0.2.1   vctrs_0.6.2        R6_2.5.1           proxy_0.4-27      
[41] lifecycle_1.0.3    lubridate_1.9.2    classInt_0.4-9     RefManageR_1.4.0   stringr_1.5.0     
[46] bit_4.0.5          vroom_1.6.3        pkgconfig_2.0.3    pillar_1.9.0       glue_1.6.2        
[51] Rcpp_1.0.10        tibble_3.2.1       tidyselect_1.2.0   rstudioapi_0.14    readr_2.1.4       
[56] compiler_4.3.0     readxl_1.4.2      
pitkant commented 1 year ago

Some notes:

"{ "error": [{"status": 400,"id": 140,"label": "ERR_TIME_DIMENSION_DUPLICATED: Time dimension is duplicated. Time dimension can not be defined in more than one clause and can not be defined as TimedimensionValue and TIME_PERIOD in the same request."}]}"

and the following error message in R:

eurostat::clean_eurostat_cache()
tbl_hp <- get_eurostat("TIPSHO40", filter = list(sinceTimePeriod = 2005, untilTimePeriod = 2007))
Request failed [400]. Retrying in 1 seconds...
Request failed [400]. Retrying in 1 seconds...
Error in get_eurostat_json(id, filters, type = type, stringsAsFactors = stringsAsFactors,  : 
  The requested url cannot be found within the get_eurostat_json function: https://ec.europa.eu/eurostat/api/dissemination/statistics/1.0/data/TIPSHO40?sinceTimePeriod=2005&untilTimePeriod=2007&format=JSON&lang=EN"

(Slightly unrelated note to this issue: It would be a good idea to take the error message label and print it to R users as well)

eurostat::clean_eurostat_cache()
tbl_hp <- get_eurostat("TIPSHO40", filter = list(time_period = 2005))
Error in `[.data.frame`(variables, rev(names(variables))) : 
  undefined columns selected
eurostat::clean_eurostat_cache()
tbl_hp <- get_eurostat("TIPSHO40", filter = list(sinceTimePeriod = 2005))
eurostat::clean_eurostat_cache()
tbl_hp <- get_eurostat("TIPSHO40", filter = list(untilTimePeriod = 2007))
eurostat::clean_eurostat_cache()
tbl_hp <- get_eurostat("TIPSHO40", filter = list(lastTimePeriod = 3))

This issue might actually be a problem on Eurostat's side.

pitkant commented 1 year ago

I have received a reply from Eurostat User Support. Their workaround was to (in the specific case of time frame 2005-2007) retrieve each time dimension 1 by 1 with the following query:

https://ec.europa.eu/eurostat/api/dissemination/statistics/1.0/data/TIPSHO40?format=JSON&lang=en&freq=Q&unit=INX_Q&geo=BE&geo=BG&geo=CZ&geo=DK&geo=DE&geo=EE&geo=IE&geo=EL&geo=ES&geo=FR&geo=HR&geo=IT&geo=CY&geo=LV&geo=LT&geo=LU&geo=HU&geo=MT&geo=NL&geo=AT&geo=PL&geo=PT&geo=RO&geo=SI&geo=SK&geo=FI&geo=SE&time=2005-Q1&time=2005-Q2&time=2005-Q3&time=2005-Q4&time=2006-Q1&time=2006-Q2&time=2006-Q3&time=2006-Q4&time=2007-Q1&time=2007-Q2&time=2007-Q3&time=2007-Q4

It is unlikely we will implement this workaround to eurostat package as Eurostat developers are apparently working with the issue right now. We will continue monitoring this.

pitkant commented 1 year ago

I tested the queries above and things seem to work now, somewhat.

The method as described in the documentation (using both sinceTimePeriod and untilTimePeriod in the same query) work:

get_eurostat("TIPSHO40", filter = list(sinceTimePeriod = 2005, untilTimePeriod = 2007))

Table TIPSHO40 cached at /var/folders/f4/h_r3y60n0nn0qm6qx5hnx1s00000gn/T//RtmpNkdn9b/eurostat/TIPSHO40_date_code_FF.rds
# A tibble: 972 × 5
   freq  unit  geo   time       values
   <chr> <chr> <chr> <date>      <dbl>
 1 Q     INX_Q BE    2005-01-01   69.7
 2 Q     INX_Q BE    2005-04-01   71.2
 3 Q     INX_Q BE    2005-07-01   73.8
 4 Q     INX_Q BE    2005-10-01   75.2
 5 Q     INX_Q BE    2006-01-01   76.7
 6 Q     INX_Q BE    2006-04-01   78.8
 7 Q     INX_Q BE    2006-07-01   81.1
 8 Q     INX_Q BE    2006-10-01   81.7
 9 Q     INX_Q BE    2007-01-01   83.6
10 Q     INX_Q BE    2007-04-01   85.3
# ℹ 962 more rows

However, the following example still doesn't work:

> eurostat::clean_eurostat_cache()
Deleted .rds/.RData files from /var/folders/f4/h_r3y60n0nn0qm6qx5hnx1s00000gn/T//RtmpNkdn9b/eurostat

> get_eurostat("TIPSHO40", filter = list(time = c(2005:2008)))
Error in `[.data.frame`(variables, rev(names(variables))) : 
  undefined columns selected
pitkant commented 1 year ago

Eurostat user support answered me that the allowed filter = list(sinceTimePeriod = 2005, untilTimePeriod = 2007) works now and, as already seen before, I can confirm it.

The explicitly not-accepted filtering filters = list(time = c(2005:2008) does not work in case of TIPSHO40 dataset but it works with other datasets. Maybe this is related to TIPSHO40 being quarterly data? I asked Eurostat to clarify their documentation on this.

What should work also on quarterly data, according to Eurostat documentation, is filters = list(time = 2005). It does not work and I've asked Eurostat for clarification. As a workaround data for a singular year can be filtered with filters = list(sinceTimePeriod = 2005, untilTimePeriod = 2005) or maybe with eurostat own suggestion filters = list(time = 2005-Q1, time = 2005-Q2, time = 2005-Q3, time = 2005-Q4) (which funnily enough was supposed to be not-accepted).

If quarterly data works differently from other types of data, I will clarify this in eurostat function documentation

pitkant commented 8 months ago

Closed with the CRAN release of package version 4.0.0