rOpenGov / eurostat

R tools for Eurostat data
http://ropengov.github.io/eurostat
Other
234 stars 46 forks source link

Package grooming and performance improvements #230

Open dieghernan opened 2 years ago

dieghernan commented 2 years ago

Update


Hi @antagomir @pitkant:

Now that the package moved to a new version, let me share with you some potential improvements I would like to suggest:

Documentation

reference:
- title: Political
  desc: These functions return `sf` objects with political boundaries
  contents: has_concept("political")
- title: Infrastructures
  desc: These functions return `sf` objects with man-made features
  contents: has_concept("infrastructure")
...

Dependencies and tests

This is a sensitive issue. Currently the package has 19 Imports and 13 Suggests. Also, the package has more than 100 tests for get_eurostat_geospatial() and still the coverage is below 55%.

These two issues cause that the package is heavy when installing on a fresh system and also really slow to tests.

I think it is possible to reduce dependencies to 17+5 (get rid of 10 packages) with almost no costs. On the test side, more work is needed, but probably there are room for improvement.

As a suggestion, {giscoR} could help with the spatial part but {sf} would be still needed.

I see the Documentation part quite easy (I can even set an action for performing these tasks). Also I can work on reducing dependencies.

Let mw know your thoughs, regards

pitkant commented 2 years ago

Regarding documentation:

Regarding package dependencies and tests:

I think we could start a project board aiming for eurostat 4.0.0 release. The specifics on what that release would contain could be discussed under the project page.

dieghernan commented 2 years ago

I like the idea of the project board πŸ˜„

It can start with the Documentation part, and regarding the packages I have testes:

Imports:
    broom,
    classInt,
    countrycode,
    curl,
    dplyr,
    httr,
    magrittr,
    jsonlite,
    lubridate,
    [del] RColorBrewer,
    rappdirs,
    readr,
    RefManageR,
    stringi,
    stringr,
    tibble,
    tidyr,
    regions,
    rlang
Suggests:
    [del] covr,
    [del] Cairo,
    [del] ggplot2,
    knitr,
    [del] markdown,
    rmarkdown,
    [del] roxygen2,
    [del] rvest,
    testthat (>= 3.0.0),
    [del]  tmap,
    [del] usethis,
    sf,
    [del] here
    [new] sp

Most of this packages are not even used on the vignettes, but on the articles. We can install that in the GitHub Action, as the articles are not part of the package.

Regarding Cairo, this is a bit tricky. I think it is used for plots. Now the default (at least on pkgdown) is ragg, that provides high-quality plots. This one makes me doubt a bit, but It could be removed amending also the vignette.

antagomir commented 2 years ago

I agree with all, I very much like to idea of moving to md docs.

The suggested ways to remove dependencies seem good as well as work towards version 4.0.0 (what does R Extension say about version numbers, should this be 4.1.1 instead?)

One way to reduce dependencies is to move all spatial visualization stuff elsewhere, this has been under discussion for a longer time.

It also seems that there is now an emerging ecosystem of packages around eurostat, including also those from @antaldaniel - we could think if it would make sense to create an more comprehensive online resource on using and enrichning eurostat (and related?) data sources based on this set of interoperable packages. But this latter one would require a bit more planning and time.

jhuovari commented 2 years ago

Good ideas.

I think there are also performance issues in the get_eurostat() and particulary in the tidy_eurostat(). I tried to speed the code in the branch speed and managed to improve the performance significantly. Unfortunately, I never got time to finish the job, and now the branch is way behind the master. But there is potential... Unless someone else have already improved code meanwhile.

dieghernan commented 2 years ago

Hi @antagomir Regarding versioninf, on a quick search I see on https://cran.r-project.org/doc/manuals/r-release/R-exts.html:

The mandatory β€˜Version’ field gives the version of the package. This is a sequence of at least two (and usually three) non-negative integers separated by single β€˜.’ or β€˜-’ characters. The canonical form is as shown in the example, and a version such as β€˜0.01’ or β€˜0.01.0’ will be handled as if it were β€˜0.1-0’. It is not a decimal number, so for example 0.9 < 0.75 since 9 < 75.

On https://r-pkgs.org/description.html#version SemVer and X.Org are referred. On SemVer:

Given a version number MAJOR.MINOR.PATCH, increment the:

MAJOR version when you make incompatible API changes, MINOR version when you add functionality in a backwards compatible manner, and PATCH version when you make backwards compatible bug fixes.

Common practice when these changes are applied are just to increase major with trailing 0.

pitkant commented 1 year ago

@dieghernan I think this issue is again timely and closely related to #263 since several geospatial packages are going to be removed from CRAN or changed. I think now the time would be ripe to "move" get_eurostat_geospatial functionalities out of eurostat package and rely on giscoR instead.

dieghernan commented 1 year ago

I can prepare a PR switching to giscoR and including it on the Suggest field.

I think it should be pretty straightforward since the parameters of get_eurostat_geospatial are almost the same than those of gisco_get_nuts. Is that ok?

And if so, over which branch? I see some development coming recently on v4-dev

pitkant commented 1 year ago

I think v4-dev (or, if you like, another branch that would be ultimately merged into v4) would be good. These changes fall easily under the MAJOR changes label so it's justifiable to put it there instead of in a minor release.

I'm not certain what the schedule for the release of v4.0 should be, or how long we have to go through the open issues and solve them. Before October 2023 anyway, because of this geospatial package schedule and #243, which also happens to have its deadline in October.