rOpenGov / eurostat

R tools for Eurostat data
http://ropengov.github.io/eurostat
Other
234 stars 46 forks source link

Cached datasets #257

Closed lz1nwm closed 8 months ago

lz1nwm commented 1 year ago

By default eurosatat caches datasets when it is run for the first time during the session, but it does not check if the cached table contains all the data needed to proceed the consecutive requests to the same table in Eurostat. I'm not sure if this is the intended behaviour. Please see the following example:

> get_eurostat('nama_10_gdp', filters = list(geo = c('EA'), 
+                                            unit = c('CP_MEUR'), 
+                                            na_item = c('B1GQ'),
+                                            time = c(2020:2022)), 
+              time_format = "date_last")
Table nama_10_gdp cached at ...
# A tibble: 3 x 6
  freq  unit    na_item geo   time          values
  <chr> <chr>   <chr>   <chr> <date>         <dbl>
1 A     CP_MEUR B1GQ    EA    2020-12-31 11456918.
2 A     CP_MEUR B1GQ    EA    2021-12-31 12318505.
3 A     CP_MEUR B1GQ    EA    2022-12-31 13338550.

> get_eurostat('nama_10_gdp', filters = list(geo = c('DE'), 
+                                            unit = c('CP_MEUR'), 
+                                            na_item = c('B1GQ'),
+                                            time = c(2020:2022)), 
+              time_format = "date_last")
Reading cache file ...
# A tibble: 3 x 6
  freq  unit    na_item geo   time          values
  <chr> <chr>   <chr>   <chr> <date>         <dbl>
1 A     CP_MEUR B1GQ    EA    2020-12-31 11456918.
2 A     CP_MEUR B1GQ    EA    2021-12-31 12318505.
3 A     CP_MEUR B1GQ    EA    2022-12-31 13338550.
antagomir commented 1 year ago

Ah, right. Probably not intended and should be fixed as soon as the time will allow.

Could you consider making a PR?

lz1nwm commented 1 year ago

Could you consider making a PR?

Unfortunately, I have no practice with PRs but I 'll see if I could do something.

antagomir commented 1 year ago

Here some instructions: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request

pitkant commented 1 year ago

I can see the inconvenience but I think it's debatable whether this is unintended behaviour or not. The point of caching is to make the least amount of requests to Eurostat servers and writing a fix that would constantly compare the cached file with the unfiltered remote file would create unnecessary web traffic between end-users and Eurostat.

Caching can be easily disabled, although it is currently enabled by default. Maybe this is more of an issue related to documentation? Would adding some explicit messages when downloading and caching data make users more aware of this limitation?

lz1nwm commented 1 year ago

Just to clarify, my point was that I would expect the second query in my example to return an empty table and/or send the query to Eurostat. Basically, the cached table after the first query is only a small part of the dataset and obviously it could not be used for broader queries.

pitkant commented 1 year ago

Thank you for clarifying. The reason (whether it be good or not, you decide) why it works like that is that the query parameters are passed onto the request made to the Eurostat database. For some query parameters no filtering is done locally, whereas in some cases there is some at least some processing done locally (if not filtering). An example of the latter is handling Eurostat date strings and turning them to date objects.

Yes, we could be possible to add some additional local checks before printing the output, to see whether the geo column has the desired areas or if the time frame is as desired; if not, then print a message to the user or attempt to refresh the cached dataset. Or maybe the query could be saved with the cached dataset and only use the cached data if the queries are identical.

pitkant commented 1 year ago

As referenced in issue #258 it might make more sense to cache datasets that were downloaded without filtering than caching filtered datasets. Then, if the complete dataset was cached locally, it could also be filtered locally, solving both issues at a single stroke.

pitkant commented 8 months ago

Closed with the CRAN release of package version 4.0.0