mapme-initiative / mapme.biodiversity

Efficient analysis of spatial biodiversity datasets for global portfolios
https://mapme-initiative.github.io/mapme.biodiversity/dev
GNU General Public License v3.0
32 stars 7 forks source link

Add functionality to transform output to wide #70

Closed Jo-Schie closed 1 year ago

Jo-Schie commented 2 years ago

I noticed, that currently it is rather hard for users to get results back in an intuitive wide format (i.e. AOIs as units of observations and variables as columns, and geoinformation attached). I understand, that the outputs are rather heterogenous so we could discuss whether this feature really makes sense and whether it is possible to implement it easily or not. I think we aleady discussed it as well in the past but I do not really recall the pro's and con's, so maybe we could list them here @goergen95 ?

I think I also suggested at some point as an alternative to implement code in a script that shows how users could do that. Do we have this already in the documentation. If not I can provide some sample code here as well.

goergen95 commented 2 years ago

Currently, we store the results of indicator functions as a nested list column in an sf object. This allows us to support arbitraty shaped outputs of indicator routines without the need to copy the geometry column (which would be the case if we worked directly in a long format). For a wide format we would have to agree on a standardized template that each indicator (also future indicators) has to adhere to, which in my point of view is not really possible without substantial restrictions to which indicators we support. It also actually increases the work users have to put into the analysis. For example, if we have a temporal dimension with an indicator and use a wide format, users will actually have to split up the column names into the parameter and the date for some types of analysis.

As to your question to show users how to unnest the data, if you look into the examples of the online documentation, each indicator is unnested there. Sometimes this results in a long format (e.g. in the case for CHIRPS precipitation, for indicators without a temporal dimension we actually end up with a wide format (e.g. for elevation).

Jo-Schie commented 2 years ago

I understand the idea behind the nested columns and I think it makes sense if we think about the different shaped output structures. That is totally okay with me as the default format.

Nevertheless, if not supported in the package then I would argue that we need in the documentation some easy to replicate example on how to pivot data to wide and create a "normal" dataset again in the standard sfway with features as columns and the geometry attached to it. That is just the way that most users of geospatial data will expect their data to be and work with.

As soon as I find some time I'll make a suggestion and put it together with changes in the documentation suggested in #72.

goergen95 commented 1 year ago

Add a documentation section for nested columns to wide

goergen95 commented 1 year ago

Here is an interesting package for such use cases that we might consider https://cran.r-project.org/web/packages/cubble/index.html

Jo-Schie commented 1 year ago

The cobble is nice indeed. Let's keep an eye on this package and see how it develops over time.

goergen95 commented 1 year ago

So here is a small example of how I would approach re-shaping a portfolio to a wide format. The issue with this is that we cannot really predict which column names are exactly there and how the reshaping should be done exactly because that may vary with the use case. Additionally, it is not really considered good-practice to hide variables in the column names (e.g. the date for the precipitation indicator). Also, the pivot_wider method implemented for sf objects currently does not seem to work properly, thus you would have to store the geometries in another variable, do the re-shaping on a tibble/df and re-join with the geometries afterwards.

I think this is not really a good approach because:

Any suggestions would be most welcome!

library(sf)
#> Linking to GEOS 3.11.1, GDAL 3.6.2, PROJ 9.1.1; sf_use_s2() is TRUE
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(tidyr)
library(mapme.biodiversity)

temp_loc <- file.path(tempdir(), "mapme.biodiversity")
if(!file.exists(temp_loc)){
  dir.create(temp_loc)
  resource_dir <- system.file("res", package = "mapme.biodiversity")
  file.copy(resource_dir, temp_loc, recursive = TRUE)
}
#> [1] TRUE

(aoi <- system.file("extdata", "sierra_de_neiba_478140_2.gpkg",
                    package = "mapme.biodiversity") %>%
    read_sf() %>%
    init_portfolio(
      years = 2010,
      outdir = file.path(temp_loc, "res"),
      tmpdir = tempdir(),
      add_resources = FALSE,
      verbose = FALSE
    ) %>%
    get_resources(
      c("nelson_et_al", "chirps"),
      range_traveltime = c("5k_10k", "100k_200k")
    ) %>%
    calc_indicators(
      "traveltime", 
      stats_accessibility = c("min", "max"), 
      engine = "extract") %>%
    calc_indicators(
      "precipitation_chirps", 
      scales_spi = 3, 
      spi_prev_years = 8, 
      engine = "extract"))
#> Simple feature collection with 1 feature and 7 fields
#> Geometry type: POLYGON
#> Dimension:     XY
#> Bounding box:  xmin: -71.80933 ymin: 18.57668 xmax: -71.33201 ymax: 18.69931
#> Geodetic CRS:  WGS 84
#> # A tibble: 1 × 8
#>   WDPAID NAME            DESIG_ENG ISO3  assetid traveltime precipitation_chirps
#>    <dbl> <chr>           <chr>     <chr>   <int> <list>     <list>              
#> 1 478140 Sierra de Neiba National… DOM         1 <tibble>   <tibble [12 × 4]>   
#> # ℹ 1 more variable: geom <POLYGON [°]>

geometries <- aoi$geom
data <- st_drop_geometry(aoi)

# time-invariant indicator with two variables
data %>%
  select(-precipitation_chirps) %>%
  unnest(traveltime) %>%
  pivot_wider(
    names_from = distance,
    values_from = minutes_min:minutes_max
  )
#> # A tibble: 1 × 9
#>   WDPAID NAME   DESIG_ENG ISO3  assetid minutes_min_5k_10k minutes_min_100k_200k
#>    <dbl> <chr>  <chr>     <chr>   <int>              <dbl>                 <dbl>
#> 1 478140 Sierr… National… DOM         1                 35                   162
#> # ℹ 2 more variables: minutes_max_5k_10k <dbl>, minutes_max_100k_200k <dbl>

# time-variant indicator with 3 variables
data %>%
  select(-traveltime) %>%
  unnest(precipitation_chirps) %>%
  pivot_wider(
    names_from = dates,
    values_from = absolute:spi_3
  )
#> # A tibble: 1 × 41
#>   WDPAID NAME            DESIG_ENG     ISO3  assetid `absolute_2010-01-01`
#>    <dbl> <chr>           <chr>         <chr>   <int>                 <dbl>
#> 1 478140 Sierra de Neiba National Park DOM         1                    17
#> # ℹ 35 more variables: `absolute_2010-02-01` <dbl>,
#> #   `absolute_2010-03-01` <dbl>, `absolute_2010-04-01` <dbl>,
#> #   `absolute_2010-05-01` <dbl>, `absolute_2010-06-01` <dbl>,
#> #   `absolute_2010-07-01` <dbl>, `absolute_2010-08-01` <dbl>,
#> #   `absolute_2010-09-01` <dbl>, `absolute_2010-10-01` <dbl>,
#> #   `absolute_2010-11-01` <dbl>, `absolute_2010-12-01` <dbl>,
#> #   `anomaly_2010-01-01` <dbl>, `anomaly_2010-02-01` <dbl>, …

# sequential unnesting of parameters
data %>%
  unnest(traveltime) %>%
  pivot_wider(
    names_from = distance,
    values_from = minutes_min:minutes_max
  ) %>%
  unnest(precipitation_chirps) %>%
  pivot_wider(
    names_from = dates,
    values_from = absolute:spi_3
  )
#> # A tibble: 1 × 45
#>   WDPAID NAME   DESIG_ENG ISO3  assetid minutes_min_5k_10k minutes_min_100k_200k
#>    <dbl> <chr>  <chr>     <chr>   <int>              <dbl>                 <dbl>
#> 1 478140 Sierr… National… DOM         1                 35                   162
#> # ℹ 38 more variables: minutes_max_5k_10k <dbl>, minutes_max_100k_200k <dbl>,
#> #   `absolute_2010-01-01` <dbl>, `absolute_2010-02-01` <dbl>,
#> #   `absolute_2010-03-01` <dbl>, `absolute_2010-04-01` <dbl>,
#> #   `absolute_2010-05-01` <dbl>, `absolute_2010-06-01` <dbl>,
#> #   `absolute_2010-07-01` <dbl>, `absolute_2010-08-01` <dbl>,
#> #   `absolute_2010-09-01` <dbl>, `absolute_2010-10-01` <dbl>,
#> #   `absolute_2010-11-01` <dbl>, `absolute_2010-12-01` <dbl>, …

Created on 2023-05-15 with reprex v2.0.2

Jo-Schie commented 1 year ago

Hi @goergen95 . So from my understanding we agreed that there would not be a function inside the package on how to do this but rather a tutorial guide with one easy example. From my perspective the tutorial could start mentioning the reasons you added in your introduction to this comment about why we cannot add a function to do this, meaning basically that all Output datasets come in very distinct shapes at the moment which makes it difficult to implement. Maybe name a concrete example of two resources where the difference is very apparent.

I would then add a simplified example of your code above really focusing on the very beginner user. In that sense I think it is enough to show them one easy unnesting example. The sequential one already makes it overly complex for some and those who are interested in automating everything don't really need a tutorial after all. It is good, however to include a portfolio object with two resources as you did to avoid that someone just unnests everything. I would however suggest to use the select parameter instead of - because it is even more intuitive to understand.

What I'm missing in your example is the join again with the geometries and the export as eg geojson.

Also you might add some explanation text between all the steps like:

Then describe each step in detail. Really imagine you are describing this to someone who barely knows R. Advanced users don't need docu anyway or need it much less and are also happy with a very detailed tutorial because you can just read the relevant parts quickly.

I would also advice to store the results temporarily as objects and not use brackets to display the output of your code immediately. This just confuses basic users. Rather then this use "print()", "head()" or "str()" command to display the results.

And better separate the steps more instead of piping too much. Of course pipes are okay but from my opinion only for one logical step at a time. A very inexperienced user might be overwhelmed if several steps are done at once because you need to go through the code and reinterpret everything that happens. Just store temporary results as objects with <- and reuse them. No shame in doing that.

Hope this helps. Happy to also review and contribute to an example as soon as you have a draft version. IMHO this issue is ready to be implemented and solved.

goergen95 commented 1 year ago

Thanks a lot for your detailed comment! I gave it a try at #152. Would you mind reviewing the new vignette?

Jo-Schie commented 1 year ago

thanks @goergen95 . small adjustments and improvements suggested from our side:

goergen95 commented 1 year ago

Thanks for the feedback. I included a graphical representation of wide-vs-long with explicit emphasis on the issue arising if the geometry columns is copied. I did not so far include a QGIS screenshot (I don't feel that it would add much value) but however explicitly mentioned QGIS as a target software for exporting the data (which I feel adds important context for the whole exercise of the tutorial). I would appreciate if you can review the changes?

goergen95 commented 1 year ago

Pinging @Jo-Schie, this is ready for merge.