TIME_PERIOD instead of time

rideofyourlife commented 8 months ago

It seems that in one of the latest updates "time" parameter from get_eurostat() has been changed into "TIME_PERIOD". Obviously this is troublesome for those who have already written thousands of lines of code. The change seems to add no value for the package whatsoever. Is this because of changes that Eurostat implemented in its' API?

antagomir commented 8 months ago

Eurostat changed its API indeed. We will need to investigate if this comes from there directly; if yes then the main problem is not in the software and it will require some discussion on should we support conversions between old and new version (since those changes may have a reason..); or another option is that we forgot to add deprecation message (which should maintain the previous functionality). Now is the holiday break but we aim to come back to this asap. Meanwhile, further info and suggestions are welcome.

antaldaniel commented 8 months ago

I think that this will be an issue in every package that uses eurostat, it needed a fix in iotables and regions, too. The eurostat package does not use standard SDMX/Eurostat variable names, so I do not see any reason not to keep the time variable name instad of TIME_PERIOD. The logical solution would be to keep backward compatibility with the package, and offer to use standard variable names (but not only with TIME _PERIOD)

rideofyourlife commented 8 months ago

Not that I am whipping and hurrying you with the issue, but could we expect a decision in this regard relatively soon? I need to know whether I should start changing codes in files I use for work.

antagomir commented 8 months ago

Yes we can expect the decision soon. This week people are arriving from vacations and it takes some days to catch up. We will talk with @pitkant

pitkant commented 8 months ago

Thank you for opening this issue @rideofyourlife . Renaming time column to TIME_PERIOD was the single most breaking change of the 4.0.0 version. Such big changes can be expected when a major release is made, and jump from 3.8.3 to 4.0.0 is a major release according to Semantic Versioning practice. However, that does not mean that the update couldn't be handled better, with some features helping with backwards compatibility. Therefore I apologise for any inconvenience this change has caused.

While it does not solve your problem, here is my rationale of why the change was made: 1) Version 4.0.0 is our reaction to the changes made to the Eurostat API. When possible, I have attempted to retain the functionality and outputs of the package to be as close to version 3.x.x as possible. First, because I assume our users, you included, like how the package has functioned in the past and appreciate if they can continue to use it in a similar manner in the future as well. Second, because this allows for more efficient coding as old code can largely be left intact. In some cases this has not been possible when some resources in the Eurostat API have been outright removed - or are at a risk of being removed/deprecated at a moment's notice - and in some cases making minor adjustments has seemed to be reasonable thing to do. The reason why time has been changed to TIME_FORMAT falls to the latter category and I will attempt to explain below why it was done. 2) In the SDMX Data Structure Definition TIME_FORMAT is the name of the time dimension. When downloading data without filters, from get_eurostat_raw, not the JSON API, the data output has TIME_FORMAT as the name of the dimension / field / column (whatever you want to call it) of the tibble as well. This is a change from versions 3.x.x which used bulk download files. On the other hand data downloaded from JSON API accepts both time and time_format filters to refer to the same TIME_FORMAT dimension, but the returned JSON-STAT data object has a field called time. 3) According to a bit of advice from The Cathedral and the Bazaar "15. When writing gateway software of any kind, take pains to disturb the data stream as little as possible". Whatever is "gateway software" and "disturbing the data stream" actually mean is another thing, but I gathered the gist of it to be, in the context of the eurostat package, that it is preferable to return data objects that resemble the original data rather than making up column names ourselves. Therefore it was justified to use the new TIME_FORMAT column name, at least for data downloaded from the SDMX API. What this does not justify is changing the column name in data downloaded from JSON API, as is done in version 4.0.0. 4) The conflict lies here: I assume our users are accustomed to getting similar data outputs from both "bulk download" and "JSON API". Users are actually discouraged from using the get_eurostat_json function directly in the function manual, and are encouraged to do all their data downloading with the get_eurostat general purpose function, that directs the query to the correct API in the background. This made sense in the past as well as the outputs of the old bulk download and JSON API were similar, the output from the JSON API was different only in the sense that it could be filtered before downloading datasets. Now, this is not the case currently because of the divergence in column names: time in data downloaded from API Statistics and TIME_FORMAT in data downloaded from SDMX. 5) I made the assumption that it is easier for end users to have similar names for the time / TIME_FORMAT columns in all data, instead of having different names. Marginally, it also simplifies the code in the package so that there is no need for if-else patterns for internal data cleaning. 6) I could have done the opposite decision of forcing TIME_FORMAT column to be named time everywhere in the package as it has been before, but this felt wrong and anachronistic as the API was anyway completely different from what is was in versions 3.x.x and before. Also, if you download data from the Eurostat website it is going to be .tsv files that have a TIME_FORMAT column (in online table viewer I assume TIME stands for a label). Having parity with web interface and the package was something that I thought was important. 7) I could have (and probably should have) included some backwards compatibility option for forcing TIME_FORMAT columns to be time everywhere. It is, however, a relatively easy task in dplyr (or in base R) to do so. Also, having function arguments such as legacy_bulk_download = TRUE introduced in version 3.7.14 to get_eurostat() function felt hacky at best, even if a similar argument (such as use_legacy_time_column = TRUE) was added only for the initial 4.0.0 version with appropriate warning messages and be subsequently removed in some later release. Small things like that add to complexity and I wanted to avoid such things. 8) As version 4.0.0 was under development for several months and issues were open here for a long time, I incorrectly assumed that users would read the issues and follow the development. :) The time gap between 3.8.3 and 4.0.0 was probably too large and there could've been an intermediate version in between that could have been used for informing users about upcoming changes and for gathering feedback. On the other hand releasing experimental versions in CRAN feels a bit wrong to me, while it of course is done. What I have attempted is to release a finalised product, with experimental versions residing here in GitHub branches.

Of course we can release a new version of the package that could remedy this issue. I would personally think having an additional argument in the get_eurostat() function as described in 7) would be the most convenient approach to this. What do you think?

antagomir commented 8 months ago

If this helps our users to not break their existing workflows that would sound worth doing. Over the longer term things must be deprecated as well but a transition period can be very useful.

rideofyourlife commented 8 months ago

6. I could have done the opposite decision of forcing TIME_FORMAT column to be named time everywhere in the package as it has been before, but this felt wrong and anachronistic as the API was anyway completely different from what is was in versions 3.x.x and before. Also, if you download data from the Eurostat website it is going to be .tsv files that have a TIME_FORMAT column (in online table viewer I assume TIME stands for a label). Having parity with web interface and the package was something that I thought was important.

I do not share this opinion. "time" vs "TIME_PERIOD":

shorter,
cohesive with other variables (values, unit, geo, freq). For example we do not use "GEO_LOCATION" or "FREQ_PERIOD",
breaks continuity for the sake of doubtful usefulness of having parity with TSV files.

pitkant commented 8 months ago

That it is, although I wouldn't necessarily always equate shorter with better.
true. The difference between TIME_PERIOD and GEO_LOCATION and FREQ_PERIOD is though that TIME_PERIOD comes from the original datasets whereas GEO_LOCATION and FREQ_PERIOD do not. Additionally, FREQ_PERIOD and GEO_LOCATION are not in SDMX Metadata Registry whereas FREQ and GEO are. The point was not to arbitrarily come up with long uppercase names but to retain such name if it existed in the original dataset. However, it's important to consider whether, if for some reason they decided to change GEO to GEO_LOCATION and FREQ to FREQ_PERIOD in the original data, we should stick with the established column names that we have always used - even if they would then be only used in the eurostat package - or if we should start using the new column names from the data source. I don't believe they will but changes are always possible when it comes to APIs.
I agree with the breaking continuity part - that's an established fact - but it's not done just to achieve parity for no reason at all. The eurostat package has always downloaded and read .tsv files and returned them as tibbles in R environment. When the eurostat package downloads data from Eurostat API api/dissemination/sdmx/2.1/data/[DATASET_ID]?format=TSV&compressed=true, the returned tsv file is initially the same that users can interactively download from the Eurostat online data browser. (In theory, if the user could give manually downloaded .tsv files as an input to the eurostat package, the package could in theory handle it the same as it handles files downloaded via the API. This is something I thought about implementing but ultimately did not. It might be useful for someone in the future, since the original .tsv files would probably be better suited for long term storage and testing if the package creates similar output out of them similarly between different package versions. But that's another matter)

Again, apologies for the trouble caused. I do still feel that in the long term following and mirroring the API changes is the right thing to do. If you disagree with that maybe we can open a Discussion instead of an issue for discussing how to handle similar issues in the future.

In the meantime, I think we can fix this specific with a temporary argument to get_eurostat function that will output deprecation messages or warnings. My preferred solution would to have it be opt-in instead of opt-out, meaning that you would be required to update your scripts with this additional function argument, with the default being how eurostat 4.0.0 works now. Would that be ok?

pitkant commented 8 months ago

Alright, while I was debugging something else I noticed that what I said earlier in this discussion is at conflict with how the package actually currently works:

SDMX API: TIME_PERIOD column with years as dates

> eurostat::get_eurostat("road_go_ia_rc")
trying URL 'https://ec.europa.eu/eurostat/api/dissemination/sdmx/2.1/data/road_go_ia_rc?format=TSV&compressed=true'
downloaded 1.1 MB

Table road_go_ia_rc cached at /var/folders/f4/h_r3y60n0nn0qm6qx5hnx1s00000gn/T//Rtmpuwq0L9/eurostat/b546a73521fd8bb82d9ca4c6ad06c9f0.rds
# A tibble: 692,252 × 7
   freq  c_load c_unload unit    geo   TIME_PERIOD values
   <chr> <chr>  <chr>    <chr>   <chr> <date>       <dbl>
 1 A     AD     DE       MIO_TKM LU    2006-01-01      NA
 2 A     AD     DE       THS_T   LU    2006-01-01      NA
 3 A     AD     ES       MIO_TKM ES    1999-01-01      NA
 4 A     AD     ES       MIO_TKM ES    2000-01-01      NA
 5 A     AD     ES       MIO_TKM ES    2001-01-01      NA
 6 A     AD     ES       MIO_TKM ES    2002-01-01      NA
 7 A     AD     ES       MIO_TKM ES    2004-01-01       8
 8 A     AD     ES       MIO_TKM ES    2005-01-01      54
 9 A     AD     ES       MIO_TKM ES    2006-01-01      23
10 A     AD     ES       MIO_TKM ES    2007-01-01      NA
# ℹ 692,242 more rows
# ℹ Use `print(n = ...)` to see more rows

API Statistics: time column with just years (1999, 2000 etc.) as chr

> eurostat::get_eurostat_json("road_go_ia_rc")
# A tibble: 10,202,112 × 7
   freq  c_load    c_unload  unit  geo       time  values
   <chr> <chr>     <chr>     <chr> <chr>     <chr>  <int>
 1 A     EU27_2020 EU27_2020 THS_T EU27_2020 1999      NA
 2 A     EU27_2020 EU27_2020 THS_T EU27_2020 2000      NA
 3 A     EU27_2020 EU27_2020 THS_T EU27_2020 2001      NA
 4 A     EU27_2020 EU27_2020 THS_T EU27_2020 2002      NA
 5 A     EU27_2020 EU27_2020 THS_T EU27_2020 2003      NA
 6 A     EU27_2020 EU27_2020 THS_T EU27_2020 2004      NA
 7 A     EU27_2020 EU27_2020 THS_T EU27_2020 2005      NA
 8 A     EU27_2020 EU27_2020 THS_T EU27_2020 2006      NA
 9 A     EU27_2020 EU27_2020 THS_T EU27_2020 2007      NA
10 A     EU27_2020 EU27_2020 THS_T EU27_2020 2008  788401
# ℹ 10,202,102 more rows
# ℹ Use `print(n = ...)` to see more rows

Getting years as chr from get_eurostat_json is actually not that weird since functions like eurotime2date are run in get_eurostat, meaning that if get_eurostat_json is run directly the output will not be handled by those helper functions. This is not documented properly, although users are not encouraged to use get_eurostat_json directly either.

The reason for this is that tidy_eurostat is run only for .tsv files (retrieved with get_eurostat_raw()) and that takes advantage of knowing the names of certain columns in advance, namely TIME_PERIOD. Data retrieved through the API Statistics / "JSON API" does not have need for this kind of handling. I certainly had the intention of having the output be uniform across data retrieval methods but apparently it was not the case in the shipped code. This needs to be fixed and since the 4.0.0 was already shipped with get_eurostat_json returning data with time columns it might not be such a big reversal to make that the default method for .tsv files as well (with all that was said above still being worth considering in the longer perspective).

The conclusion being that I would make the new TIME_PERIOD column names opt-in rather than opt-out, which would give us some time to consider how to advance in future releases.

pitkant commented 4 months ago

Already posted on #288 but I'll put this here as well:

@rideofyourlife I have uploaded some WIP code in v4.1 branch. It enables users to make queries the same way as before but adds an additional parameter legacy.data.output to get_* functions that transforms dimensions names such as TIME_PERIOD and OBS_VALUE to time and values that were used before and removes extra columns such as freq, DATAFLOW and LAST UPDATE altogether.

Related to what I wrote earlier here:

The conclusion being that I would make the new TIME_PERIOD column names opt-in rather than opt-out, which would give us some time to consider how to advance in future releases.

I would maybe back out on this as it is much simpler to use Eurostat conventions internally and then change the data output just before the object is returned to the user, after all data labelling etc. tasks are done.

If you could test this and give some feedback on what you think it would be great!

rOpenGov / eurostat

TIME_PERIOD instead of time #285