rOpenGov / eurostat

R tools for Eurostat data
http://ropengov.github.io/eurostat
Other
234 stars 46 forks source link

Dubious value added of the 'freq' variable #288

Open rideofyourlife opened 8 months ago

rideofyourlife commented 8 months ago

Many datasets, which contain only one frequency available (like _namq_10gdp, _sts_inprm etc.), were awarded a new variable "freq". I generally understand the idea behind it, but while working on the package it has only proven to often be an unnecessary step of %>% select (-freq) in majority of the code I write.

Does anyone else have similar thoughts?

pitkant commented 8 months ago

This is intentional behaviour but if many people find this annoying you can report it under this issue and we can reconsider.

antaldaniel commented 8 months ago

Statistical agencies worldwide have similar standards treating metadata, and metadata in this case is there to avoid unforeseen logical errors when joining or linking data; I think that the freq variable is present when there are similar statistical products or datasets available with the same variables but different frequencies. In that case a joining without frequency adjustment results in a hard to find logical error. The freq variable is the same as the unit variable, you really want to avoid unknowingly divide euros with thousand euros, or multiply annual values in a chain with quarterly values.

rideofyourlife commented 7 months ago

Statistical agencies worldwide have similar standards treating metadata, and metadata in this case is there to avoid unforeseen logical errors when joining or linking data;

Well, we are all aware. At least I hope so it is the case.

In that case a joining without frequency adjustment results in a hard to find logical error. The freq variable is the same as the unit variable, you really want to avoid unknowingly divide euros with thousand euros, or multiply annual values in a chain with quarterly values.

This would assume users are somewhat unaware of what they are doing. It seems to me that implementation of this technique is triumph of form over content.

pitkant commented 4 months ago

@rideofyourlife I have uploaded some WIP code in v4.1 branch. It enables users to make queries the same way as before but adds an additional parameter legacy.data.output to get_* functions that transforms dimensions names such as TIME_PERIOD and OBS_VALUE to time and values that were used before and removes extra columns such as freq, DATAFLOW and LAST UPDATE altogether.

If you could test this and give some feedback on what you think it would be great!

rideofyourlife commented 3 months ago

I have already laboriously replaced "time" with "TIME_PERIOD" in all my codes, so having "time" back is not as essential now as it had been before the recent change. Despite that, where do I use this legacy.data.output? In which function?

pitkant commented 3 months ago

I'm sorry for the laborious process. In version 4.1 legacy.data.output = TRUE parameter in get_eurostat() function should return a similar data.frame / tibble as it returned in version 3.8.3 and before.

rideofyourlife commented 3 months ago

Ah, yes: it works. It is just not suggested by R Studio while writing for some reason.