rOpenGov / eurostat

R tools for Eurostat data
http://ropengov.github.io/eurostat
Other
234 stars 46 forks source link

Leverage on giscoR #264

Closed dieghernan closed 1 year ago

dieghernan commented 1 year ago

Hi @antagomir @pitkant

This PR is related with #230 and #263

Basically now get_eurostat_geospatial is a wrapper of giscoR::gisco_get_nuts. I skimmed the function as much as I can so issues on that function would be addressed now in giscoR, reducing the burden of maintenance by eurostat package.

I also reviewed the dependencies removing a bunch of them, updated the vignettes and the actions and deleted unused files in docs/vignettes/revdep.

I checked the package and revdeps and all it’s fine. Summary of changes:

pitkant commented 1 year ago

Excellent work, thank you very much! I tested this version and everything seems to be working smoothly.

I was left thinking about issue #240 since now there's a difference between datasets read from the cache and datasets downloaded from GISCO. Like this:

Downloaded from GISCO by using resolution "20":

> sf <- get_eurostat_geospatial(
+ output_class = "sf",
+  resolution = "20",
+  nuts_level = "all"
+  )
Extracting data using giscoR package, please report issues on https://github.com/rOpenGov/giscoR/issues
Cache management as per giscoR. see 'giscoR::gisco_get_nuts()'
> sf
Simple feature collection with 2016 features and 10 fields
Geometry type: GEOMETRY
Dimension:     XY
Bounding box:  xmin: -63.08825 ymin: -21.39077 xmax: 55.83808 ymax: 71.11814
Geodetic CRS:  WGS 84
First 10 features:
                    NAME_LATN MOUNT_TYPE CNTR_CODE URBN_TYPE COAST_TYPE FID                  NUTS_NAME NUTS_ID LEVL_CODE                       geometry geo
1                     ESPAÑA           0        ES         0          0  ES                    ESPAÑA       ES         0 MULTIPOLYGON (((-7.03184 43...  ES
2             SUOMI / FINLAND          0        FI         0          0  FI            SUOMI / FINLAND      FI         0 MULTIPOLYGON (((28.92968 69...  FI
3                      ÍSLAND          0        IS         0          0  IS                     ÍSLAND      IS         0 POLYGON ((-22.0074 63.83599...  IS
4  REGIÃO AUTÓNOMA DOS AÇORES          0        PT         0          0 PT2 REGIÃO AUTÓNOMA DOS AÇORES     PT2         1 MULTIPOLYGON (((-25.6919 37... PT2
5                      FRANCE          0        FR         0          0  FR                     FRANCE      FR         0 MULTIPOLYGON (((2.60704 50....  FR
6                    HRVATSKA          0        HR         0          0  HR                   HRVATSKA      HR         0 MULTIPOLYGON (((16.59681 46...  HR
7                MAGYARORSZÁG          0        HU         0          0  HU               MAGYARORSZÁG      HU         0 POLYGON ((22.12108 48.37831...  HU
[...]

Read from cache by using resolution "60":

> sf <- get_eurostat_geospatial(
+ output_class = "sf",
+  resolution = "60",
+  nuts_level = "all"
+  )
Extracting data from eurostat::eurostat_geodata_60_2016
> sf
Simple feature collection with 2016 features and 10 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -61.841 ymin: -21.376 xmax: 55.85 ymax: 71.178
Geodetic CRS:  WGS 84
First 10 features:
   LEVL_CODE NUTS_ID URBN_TYPE CNTR_CODE               NAME_LATN               NUTS_NAME MOUNT_TYPE COAST_TYPE FID geo                       geometry
1          0      AL         0        AL               SHQIPËRIA               SHQIPËRIA          0          0  AL  AL MULTIPOLYGON (((19.831 42.4...
2          0      AT         0        AT              ÖSTERREICH              ÖSTERREICH          0          0  AT  AT MULTIPOLYGON (((15.754 48.8...
3          0      BE         0        BE         BELGIQUE-BELGIË         BELGIQUE-BELGIË          0          0  BE  BE MULTIPOLYGON (((5.238 51.26...
4          0      BG         0        BG                BULGARIA                БЪЛГАРИЯ          0          0  BG  BG MULTIPOLYGON (((22.675 44.2...
5          0      CH         0        CH SCHWEIZ/SUISSE/SVIZZERA SCHWEIZ/SUISSE/SVIZZERA          0          3  CH  CH MULTIPOLYGON (((8.67 47.685...
6          0      CY         0        CY                  KYPROS                  ΚΥΠΡΟΣ          0          0  CY  CY MULTIPOLYGON (((34.633 35.8...
[...]

So the difference being mainly in the position of the "geo" column. I think having "geometry" column as the last column of the "sf" "data.frame" object would be desirable. Another question is whether it's actually needed, but then again someone needed it in the first place since it was added there. Does @muuankarski have any input on this?

dieghernan commented 1 year ago

I wasn’t aware of #240, let me have a look

dieghernan commented 1 year ago

Hi @pitkant

After https://github.com/rOpenGov/eurostat/pull/264/commits/3a6e48fdeb8ee8540143191e40b987e09ad8ff15 the output of the spatial objects are consistent in the number, name and position of the columns. If I check for all years:

years <- c('2003','2006','2010','2013','2016','2021')
allyears <- lapply(years, function(x){

  tb <- suppressMessages(eurostat::get_eurostat_geospatial(country = "LU",
                                          nuts_level = 0, resolution = 20,
                                          update_cache = TRUE))

  # Prepare summary
  df <- data.frame(column = names(tb),
                   ncol = seq_len(ncol(tb)))
  df <- tidyr::pivot_wider(df, names_from = column, values_from = ncol)
  df$year <- x

  df
})

dplyr::bind_rows(allyears) |>
  dplyr::relocate(year, .before = 1) |>
  knitr::kable(caption = "Number indicates the position of the column in the object")
year id LEVL_CODE NUTS_ID CNTR_CODE NAME_LATN NUTS_NAME MOUNT_TYPE URBN_TYPE COAST_TYPE FID geo geometry
2003 1 2 3 4 5 6 7 8 9 10 11 12
2006 1 2 3 4 5 6 7 8 9 10 11 12
2010 1 2 3 4 5 6 7 8 9 10 11 12
2013 1 2 3 4 5 6 7 8 9 10 11 12
2016 1 2 3 4 5 6 7 8 9 10 11 12
2021 1 2 3 4 5 6 7 8 9 10 11 12

Number indicates the position of the column in the object

Created on 2023-07-31 with reprex v2.0.2

All the objects have the same columns in the same position. Note that, as mentioned in #240, for some years (i.e., from 2003 to 2013) there are columns that are not provided (i.e., MOUNT_TYPE and friends). In those cases, our output would return that columns but with NA , so it would be more consistent for users:

eurostat::get_eurostat_geospatial(year = 2010) |> 
  dplyr::glimpse()
#> Extracting data using giscoR package, please report issues on https://github.com/rOpenGov/giscoR/issues
#> Rows: 1,920
#> Columns: 12
#> $ id         <chr> "AT111", "AT112", "AT113", "AT121", "AT122", "AT123", "AT12…
#> $ LEVL_CODE  <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,…
#> $ NUTS_ID    <chr> "AT111", "AT112", "AT113", "AT121", "AT122", "AT123", "AT12…
#> $ CNTR_CODE  <chr> "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT", "AT",…
#> $ NAME_LATN  <chr> "Mittelburgenland", "Nordburgenland", "Südburgenland", "Mos…
#> $ NUTS_NAME  <chr> "Mittelburgenland", "Nordburgenland", "Südburgenland", "Mos…
#> $ MOUNT_TYPE <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ URBN_TYPE  <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ COAST_TYPE <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
#> $ FID        <chr> "AT111", "AT112", "AT113", "AT121", "AT122", "AT123", "AT12…
#> $ geo        <chr> "AT111", "AT112", "AT113", "AT121", "AT122", "AT123", "AT12…
#> $ geometry   <MULTIPOLYGON [°]> MULTIPOLYGON (((16.648 47.4..., MULTIPOLYGON (…

Created on 2023-07-31 with reprex v2.0.2

pitkant commented 1 year ago

Thank you for the fixes! The output looks nice and consistent now. I'm still not sure about adding duplicated columns in the dataset for supposedly easier joins, especially if geo and FID are both duplicates, but I guess it doesn't add too much overhead either. I added a note about geo column having the "Questioning" status

dieghernan commented 1 year ago

Thanks @pitkant , I consider this PR ready now.