Open rideofyourlife opened 8 months ago
Eurostat changed its API indeed. We will need to investigate if this comes from there directly; if yes then the main problem is not in the software and it will require some discussion on should we support conversions between old and new version (since those changes may have a reason..); or another option is that we forgot to add deprecation message (which should maintain the previous functionality). Now is the holiday break but we aim to come back to this asap. Meanwhile, further info and suggestions are welcome.
I think that this will be an issue in every package that uses eurostat, it needed a fix in iotables and regions, too. The eurostat package does not use standard SDMX/Eurostat variable names, so I do not see any reason not to keep the time variable name instad of TIME_PERIOD. The logical solution would be to keep backward compatibility with the package, and offer to use standard variable names (but not only with TIME _PERIOD)
Not that I am whipping and hurrying you with the issue, but could we expect a decision in this regard relatively soon? I need to know whether I should start changing codes in files I use for work.
Yes we can expect the decision soon. This week people are arriving from vacations and it takes some days to catch up. We will talk with @pitkant
Thank you for opening this issue @rideofyourlife . Renaming time column to TIME_PERIOD was the single most breaking change of the 4.0.0 version. Such big changes can be expected when a major release is made, and jump from 3.8.3 to 4.0.0 is a major release according to Semantic Versioning practice. However, that does not mean that the update couldn't be handled better, with some features helping with backwards compatibility. Therefore I apologise for any inconvenience this change has caused.
While it does not solve your problem, here is my rationale of why the change was made:
1) Version 4.0.0 is our reaction to the changes made to the Eurostat API. When possible, I have attempted to retain the functionality and outputs of the package to be as close to version 3.x.x as possible. First, because I assume our users, you included, like how the package has functioned in the past and appreciate if they can continue to use it in a similar manner in the future as well. Second, because this allows for more efficient coding as old code can largely be left intact. In some cases this has not been possible when some resources in the Eurostat API have been outright removed - or are at a risk of being removed/deprecated at a moment's notice - and in some cases making minor adjustments has seemed to be reasonable thing to do. The reason why time
has been changed to TIME_FORMAT
falls to the latter category and I will attempt to explain below why it was done.
2) In the SDMX Data Structure Definition TIME_FORMAT is the name of the time dimension. When downloading data without filters, from get_eurostat_raw
, not the JSON API, the data output has TIME_FORMAT
as the name of the dimension / field / column (whatever you want to call it) of the tibble as well. This is a change from versions 3.x.x which used bulk download files. On the other hand data downloaded from JSON API accepts both time
and time_format
filters to refer to the same TIME_FORMAT dimension, but the returned JSON-STAT data object has a field called time
.
3) According to a bit of advice from The Cathedral and the Bazaar "15. When writing gateway software of any kind, take pains to disturb the data stream as little as possible". Whatever is "gateway software" and "disturbing the data stream" actually mean is another thing, but I gathered the gist of it to be, in the context of the eurostat package, that it is preferable to return data objects that resemble the original data rather than making up column names ourselves. Therefore it was justified to use the new TIME_FORMAT column name, at least for data downloaded from the SDMX API. What this does not justify is changing the column name in data downloaded from JSON API, as is done in version 4.0.0.
4) The conflict lies here: I assume our users are accustomed to getting similar data outputs from both "bulk download" and "JSON API". Users are actually discouraged from using the get_eurostat_json
function directly in the function manual, and are encouraged to do all their data downloading with the get_eurostat
general purpose function, that directs the query to the correct API in the background. This made sense in the past as well as the outputs of the old bulk download and JSON API were similar, the output from the JSON API was different only in the sense that it could be filtered before downloading datasets. Now, this is not the case currently because of the divergence in column names: time
in data downloaded from API Statistics and TIME_FORMAT
in data downloaded from SDMX.
5) I made the assumption that it is easier for end users to have similar names for the time
/ TIME_FORMAT
columns in all data, instead of having different names. Marginally, it also simplifies the code in the package so that there is no need for if-else patterns for internal data cleaning.
6) I could have done the opposite decision of forcing TIME_FORMAT
column to be named time
everywhere in the package as it has been before, but this felt wrong and anachronistic as the API was anyway completely different from what is was in versions 3.x.x and before. Also, if you download data from the Eurostat website it is going to be .tsv files that have a TIME_FORMAT
column (in online table viewer I assume TIME stands for a label). Having parity with web interface and the package was something that I thought was important.
7) I could have (and probably should have) included some backwards compatibility option for forcing TIME_FORMAT
columns to be time
everywhere. It is, however, a relatively easy task in dplyr (or in base R) to do so. Also, having function arguments such as legacy_bulk_download = TRUE
introduced in version 3.7.14 to get_eurostat()
function felt hacky at best, even if a similar argument (such as use_legacy_time_column = TRUE
) was added only for the initial 4.0.0 version with appropriate warning messages and be subsequently removed in some later release. Small things like that add to complexity and I wanted to avoid such things.
8) As version 4.0.0 was under development for several months and issues were open here for a long time, I incorrectly assumed that users would read the issues and follow the development. :) The time gap between 3.8.3 and 4.0.0 was probably too large and there could've been an intermediate version in between that could have been used for informing users about upcoming changes and for gathering feedback. On the other hand releasing experimental versions in CRAN feels a bit wrong to me, while it of course is done. What I have attempted is to release a finalised product, with experimental versions residing here in GitHub branches.
Of course we can release a new version of the package that could remedy this issue. I would personally think having an additional argument in the get_eurostat()
function as described in 7) would be the most convenient approach to this. What do you think?
If this helps our users to not break their existing workflows that would sound worth doing. Over the longer term things must be deprecated as well but a transition period can be very useful.
6. I could have done the opposite decision of forcing
TIME_FORMAT
column to be namedtime
everywhere in the package as it has been before, but this felt wrong and anachronistic as the API was anyway completely different from what is was in versions 3.x.x and before. Also, if you download data from the Eurostat website it is going to be .tsv files that have aTIME_FORMAT
column (in online table viewer I assume TIME stands for a label). Having parity with web interface and the package was something that I thought was important.
I do not share this opinion. "time" vs "TIME_PERIOD":
api/dissemination/sdmx/2.1/data/[DATASET_ID]?format=TSV&compressed=true
, the returned tsv file is initially the same that users can interactively download from the Eurostat online data browser. (In theory, if the user could give manually downloaded .tsv files as an input to the eurostat package, the package could in theory handle it the same as it handles files downloaded via the API. This is something I thought about implementing but ultimately did not. It might be useful for someone in the future, since the original .tsv files would probably be better suited for long term storage and testing if the package creates similar output out of them similarly between different package versions. But that's another matter)Again, apologies for the trouble caused. I do still feel that in the long term following and mirroring the API changes is the right thing to do. If you disagree with that maybe we can open a Discussion instead of an issue for discussing how to handle similar issues in the future.
In the meantime, I think we can fix this specific with a temporary argument to get_eurostat
function that will output deprecation messages or warnings. My preferred solution would to have it be opt-in instead of opt-out, meaning that you would be required to update your scripts with this additional function argument, with the default being how eurostat 4.0.0 works now. Would that be ok?
Alright, while I was debugging something else I noticed that what I said earlier in this discussion is at conflict with how the package actually currently works:
SDMX API: TIME_PERIOD
column with years as dates
> eurostat::get_eurostat("road_go_ia_rc")
trying URL 'https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/data/road_go_ia_rc?format=TSV&compressed=true'
downloaded 1.1 MB
Table road_go_ia_rc cached at /var/folders/f4/h_r3y60n0nn0qm6qx5hnx1s00000gn/T//Rtmpuwq0L9/eurostat/b546a73521fd8bb82d9ca4c6ad06c9f0.rds
# A tibble: 692,252 × 7
freq c_load c_unload unit geo TIME_PERIOD values
<chr> <chr> <chr> <chr> <chr> <date> <dbl>
1 A AD DE MIO_TKM LU 2006-01-01 NA
2 A AD DE THS_T LU 2006-01-01 NA
3 A AD ES MIO_TKM ES 1999-01-01 NA
4 A AD ES MIO_TKM ES 2000-01-01 NA
5 A AD ES MIO_TKM ES 2001-01-01 NA
6 A AD ES MIO_TKM ES 2002-01-01 NA
7 A AD ES MIO_TKM ES 2004-01-01 8
8 A AD ES MIO_TKM ES 2005-01-01 54
9 A AD ES MIO_TKM ES 2006-01-01 23
10 A AD ES MIO_TKM ES 2007-01-01 NA
# ℹ 692,242 more rows
# ℹ Use `print(n = ...)` to see more rows
API Statistics: time
column with just years (1999, 2000 etc.) as chr
> eurostat::get_eurostat_json("road_go_ia_rc")
# A tibble: 10,202,112 × 7
freq c_load c_unload unit geo time values
<chr> <chr> <chr> <chr> <chr> <chr> <int>
1 A EU27_2020 EU27_2020 THS_T EU27_2020 1999 NA
2 A EU27_2020 EU27_2020 THS_T EU27_2020 2000 NA
3 A EU27_2020 EU27_2020 THS_T EU27_2020 2001 NA
4 A EU27_2020 EU27_2020 THS_T EU27_2020 2002 NA
5 A EU27_2020 EU27_2020 THS_T EU27_2020 2003 NA
6 A EU27_2020 EU27_2020 THS_T EU27_2020 2004 NA
7 A EU27_2020 EU27_2020 THS_T EU27_2020 2005 NA
8 A EU27_2020 EU27_2020 THS_T EU27_2020 2006 NA
9 A EU27_2020 EU27_2020 THS_T EU27_2020 2007 NA
10 A EU27_2020 EU27_2020 THS_T EU27_2020 2008 788401
# ℹ 10,202,102 more rows
# ℹ Use `print(n = ...)` to see more rows
Getting years as chr from get_eurostat_json
is actually not that weird since functions like eurotime2date
are run in get_eurostat
, meaning that if get_eurostat_json
is run directly the output will not be handled by those helper functions. This is not documented properly, although users are not encouraged to use get_eurostat_json
directly either.
The reason for this is that tidy_eurostat
is run only for .tsv files (retrieved with get_eurostat_raw()
) and that takes advantage of knowing the names of certain columns in advance, namely TIME_PERIOD
. Data retrieved through the API Statistics / "JSON API" does not have need for this kind of handling. I certainly had the intention of having the output be uniform across data retrieval methods but apparently it was not the case in the shipped code. This needs to be fixed and since the 4.0.0 was already shipped with get_eurostat_json
returning data with time
columns it might not be such a big reversal to make that the default method for .tsv files as well (with all that was said above still being worth considering in the longer perspective).
The conclusion being that I would make the new TIME_PERIOD
column names opt-in rather than opt-out, which would give us some time to consider how to advance in future releases.
Already posted on #288 but I'll put this here as well:
@rideofyourlife I have uploaded some WIP code in v4.1 branch. It enables users to make queries the same way as before but adds an additional parameter legacy.data.output to get_* functions that transforms dimensions names such as TIME_PERIOD and OBS_VALUE to time and values that were used before and removes extra columns such as freq, DATAFLOW and LAST UPDATE altogether.
Related to what I wrote earlier here:
The conclusion being that I would make the new TIME_PERIOD column names opt-in rather than opt-out, which would give us some time to consider how to advance in future releases.
I would maybe back out on this as it is much simpler to use Eurostat conventions internally and then change the data output just before the object is returned to the user, after all data labelling etc. tasks are done.
If you could test this and give some feedback on what you think it would be great!
It seems that in one of the latest updates "time" parameter from get_eurostat() has been changed into "TIME_PERIOD". Obviously this is troublesome for those who have already written thousands of lines of code. The change seems to add no value for the package whatsoever. Is this because of changes that Eurostat implemented in its' API?