Best practice for storing one-to-many data

jhollist commented 6 years ago

Moving discussion started at https://twitter.com/jhollist/status/977254736163491842

My question was what is best way to store spatial data with a single feature but multiple rows of associated data. My use case is for a sample location that has profile data. A soil core or water quality profile are both good examples of this.

A list column was proposed as a possible solution.

Some code to grab a sample dataset of soil profiles from the EPA National Coastal Wetland Assessment.

uid is the common key

library(dplyr)
library(sf)

sites_url <- "https://www.epa.gov/sites/production/files/2016-10/nwca2011_siteinfo.csv"
soil_chem_url <- "https://www.epa.gov/sites/production/files/2016-10/nwca2011_soilchem.csv"

nwca_sites_sf <- read.csv(sites_url) %>%
  st_as_sf(coords = c("ANALYSIS_LON", "ANALYSIS_LAT"),
           crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs") %>%
  filter(UID %in% c(2610,2611,2668,2669,2672,2750)) %>%
  select(uid = UID, state = STATE)

nwca_carbon <- read.csv(soil_chem_url) %>%
  filter(VISIT_NO == 1) %>%
  filter(UID %in% c(2610,2611,2668,2669,2672,2750)) %>%
  select(uid = UID, state = STATE, depth = DEPTH, bulk_density = BULK_DEN_DBF,
         tot_carbon = TOT_CARBON)

mdsumner commented 6 years ago

Could be a nice stars example, embedding a 1D array per point.

edzer commented 6 years ago

The depth values are identical for each point, so they don't form an array.

I'd try a nested df:

library(dplyr)
# Attaching package: ‘dplyr’
# The following objects are masked from ‘package:stats’:
#     filter, lag
# The following objects are masked from ‘package:base’:
#     intersect, setdiff, setequal, union
library(sf)
# Linking to GEOS 3.5.1, GDAL 2.2.1, proj.4 4.9.3
sites_url <- "https://www.epa.gov/sites/production/files/2016-10/nwca2011_siteinfo.csv"
soil_chem_url <- "https://www.epa.gov/sites/production/files/2016-10/nwca2011_soilchem.csv"

nwca_sites_sf <- read.csv(sites_url) %>%
  st_as_sf(coords = c("ANALYSIS_LON", "ANALYSIS_LAT"),
           crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs") %>%
  filter(UID %in% c(2610,2611,2668,2669,2672,2750)) %>%
  select(uid = UID, state = STATE)

nwca_carbon <- read.csv(soil_chem_url) %>%
  filter(VISIT_NO == 1) %>%
  filter(UID %in% c(2610,2611,2668,2669,2672,2750)) %>%
  select(uid = UID, state = STATE, depth = DEPTH, bulk_density = BULK_DEN_DBF,
         tot_carbon = TOT_CARBON)

library(tidyr)
nwca_carbon %>% group_by(uid) %>% nest() %>% left_join(nwca_sites_sf)
# Joining, by = "uid"
# # A tibble: 6 x 4
#     uid data             state             geometry
#   <int> <list>           <fct>          <POINT [°]>
# 1  2610 <tibble [3 × 4]> FL     (-85.04428 30.1509)
# 2  2611 <tibble [2 × 4]> SC    (-80.45277 32.53445)
# 3  2668 <tibble [4 × 4]> FL    (-80.74191 28.69823)
# 4  2669 <tibble [3 × 4]> FL     (-81.7658 29.66114)
# 5  2672 <tibble [3 × 4]> FL    (-80.72645 28.27577)
# 6  2750 <tibble [1 × 4]> SC    (-80.65901 32.39959)

Question is: what to do next?

jhollist commented 6 years ago

The depth values shouldn't be identical in this case... Am I missing something?

That being said, the nested df looks like it will work really well and I think solves my original question (and I learned tidyr::nest!)

This is a common enough thing (I think) that some documentation of this would be useful. Not sure it rises to level of addition to sf or another package. If I get time, I can write up a post for my neglected blog.

edzer commented 6 years ago

There is btw nothing principled against treating these data as features with 3D geometries. It just depends on what you want to do with them, next.

library(dplyr)

# Attaching package: ‘dplyr’

# The following objects are masked from ‘package:stats’:

#     filter, lag

# The following objects are masked from ‘package:base’:

#     intersect, setdiff, setequal, union

library(sf)
# Linking to GEOS 3.5.1, GDAL 2.2.1, proj.4 4.9.3

sites_url <- "https://www.epa.gov/sites/production/files/2016-10/nwca2011_siteinfo.csv"
soil_chem_url <- "https://www.epa.gov/sites/production/files/2016-10/nwca2011_soilchem.csv"

nwca_sites <- read.csv(sites_url) %>%
  filter(UID %in% c(2610,2611,2668,2669,2672,2750)) %>%
  select(uid = UID, state = STATE, lon = ANALYSIS_LON, lat = ANALYSIS_LAT)

nwca_carbon <- read.csv(soil_chem_url) %>%
  filter(VISIT_NO == 1) %>%
  filter(UID %in% c(2610,2611,2668,2669,2672,2750)) %>%
  select(uid = UID, state = STATE, depth = DEPTH, bulk_density = BULK_DEN_DBF,
         tot_carbon = TOT_CARBON)

library(tidyr)
j <- left_join(nwca_sites, nwca_carbon)
# Joining, by = c("uid", "state")

pts3D <- j %>% select(lon, lat, depth) %>% t %>% as.data.frame %>% lapply(st_point) %>% st_sfc(crs = 4326)

(sf <- j %>% select(-lon, -lat, -depth) %>% st_sf(pts3D))
# Simple feature collection with 16 features and 4 fields
# geometry type:  POINT
# dimension:      XYZ
# bbox:           xmin: -85.04428 ymin: 28.27577 xmax: -80.45277 ymax: 32.53445
# epsg (SRID):    4326
# proj4string:    +proj=longlat +datum=WGS84 +no_defs
# First 10 features:
#     uid state bulk_density tot_carbon                          pts3D
# 1  2610    FL         0.87       2.11 POINT Z (-85.04428 30.1509 15)
# 2  2610    FL         1.08       1.14 POINT Z (-85.04428 30.1509 47)
# 3  2610    FL         1.33       0.63 POINT Z (-85.04428 30.1509 85)
# 4  2611    SC         0.60       3.45 POINT Z (-80.45277 32.53445...
# 5  2611    SC         1.27       0.52 POINT Z (-80.45277 32.53445...
# 6  2668    FL         1.11       0.85 POINT Z (-80.74191 28.69823 6)
# 7  2668    FL         1.10       0.21 POINT Z (-80.74191 28.69823...
# 8  2668    FL         1.15       0.72 POINT Z (-80.74191 28.69823...
# 9  2668    FL         1.50       0.18 POINT Z (-80.74191 28.69823...
# 10 2669    FL         0.30      18.91  POINT Z (-81.7658 29.66114 9)

mdsumner commented 6 years ago

I was thinking of it the other way, one row for each XY and an embedded 1D stars array (each in its own local depth-1D-geometry at the XY site). The stars array is multi-attributed, with uid, state, depth, bulk_density, tot_carbon, and depth is the special 1D raster axis. That way there's a total separation of site from the profile topology, and no duplication of the XY sites - but, exactly as you say, it depends what's next.

I do think this is a good example, and worth discussing the various approaches!

boshek commented 6 years ago

@jhollist Though not entirely on-topic here, I use tidyr::nest() and sf all the time together. It really facilitates having your spatial data in exactly the same place as your parameters which you have then the ability do all sorts of wonderful things with purrr. That's what I originally meant on twitter. A trivial example here:

library(tidyr)
library(purrr)
library(broom)
nested_nwca_carbon <- nwca_carbon %>% 
  group_by(uid) %>% 
  nest() %>% 
  left_join(nwca_sites_sf) %>%
  mutate(mean_carbon = map2_dbl(data, state, ~mean(.x$tot_carbon))) %>%
  mutate(model = map2(data, state, ~lm(tot_carbon ~ depth, data = .x))) %>%
  mutate(r.squared = map_dbl(model, ~glance(.x) %>% pull(r.squared)))

Which then can be immediately mapped to a plot:

library(ggplot2) ##dev
ggplot() +
  geom_sf(data = nested_nwca_carbon, aes(colour = r.squared, size = mean_carbon))

jhollist commented 6 years ago

Thanks for the examples, @boshek, especially the purrr. I've been very slowly adding that to my bag of tricks.

Given the discussion here and on twitter there are a lot of possibilities:

nested df
depth on z
mudata2 package (suggested on twitter by @paleolimbot, but I haven't had a chance to look: https://github.com/paleolimbot/mudata)

This has been an incredibly useful discussion for me. So thanks all!

paleolimbot commented 6 years ago

Here's your above example using mudata2, in the off chance it's useful...it's basically trying to keep things together as best it can, including when reading/writing to a directory or JSON file. Because there are some irregularities with tidyverse methods and sf objects, the locations table is just a tibble with an sfc column rather than an sf object...but eventually it will be an sf (if there is exactly one geometry column).

library(dplyr)
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.1.3, proj.4 4.9.3

sites_url <- "https://www.epa.gov/sites/production/files/2016-10/nwca2011_siteinfo.csv"
soil_chem_url <- "https://www.epa.gov/sites/production/files/2016-10/nwca2011_soilchem.csv"

nwca_sites_sf <- read.csv(sites_url) %>%
  st_as_sf(coords = c("ANALYSIS_LON", "ANALYSIS_LAT"),
           crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs") %>%
  filter(UID %in% c(2610,2611,2668,2669,2672,2750)) %>%
  select(uid = UID, state = STATE)

nwca_carbon <- read.csv(soil_chem_url) %>%
  filter(VISIT_NO == 1) %>%
  filter(UID %in% c(2610,2611,2668,2669,2672,2750)) %>%
  select(uid = UID, state = STATE, depth = DEPTH, bulk_density = BULK_DEN_DBF,
         tot_carbon = TOT_CARBON)

library(mudata2)
md <- mudata(
  # data table is a parameter-long table with one row per measurement
  # and a character "location" identifier
  data = nwca_carbon %>%
    mutate(location = paste(state, uid, sep = "_")) %>%
    select(-uid, -state)  %>%
    tidyr::gather(-location, -depth, key = "param", value = "value"),

  # locations table can be any tbl, including an sf object
  # needs the same character "location" identifier
  locations = nwca_sites_sf %>%
    mutate(location = paste(state, uid, sep = "_"))
)
#> Guessing x columns: depth

md
#> A mudata object aligned along "depth"
#>   distinct_datasets():  "default"
#>   distinct_locations(): "FL_2610", "FL_2668" ... and 4 more
#>   distinct_params():    "bulk_density", "tot_carbon"
#>   src_tbls():           "data", "locations" ... and 3 more
#> 
#> tbl_data() %>% head():
#> # A tibble: 6 x 5
#>   dataset location param        depth value
#>   <chr>   <chr>    <chr>        <int> <dbl>
#> 1 default FL_2610  bulk_density    15 0.870
#> 2 default FL_2610  bulk_density    47 1.08 
#> 3 default FL_2610  bulk_density    85 1.33 
#> 4 default SC_2611  bulk_density    28 0.600
#> 5 default SC_2611  bulk_density   125 1.27 
#> 6 default FL_2668  bulk_density     6 1.11
md %>% select_locations(FL_2610)
#> A mudata object aligned along "depth"
#>   distinct_datasets():  "default"
#>   distinct_locations(): "FL_2610"
#>   distinct_params():    "bulk_density", "tot_carbon"
#>   src_tbls():           "data", "locations" ... and 3 more
#> 
#> tbl_data() %>% head():
#> # A tibble: 6 x 5
#>   dataset location param        depth value
#>   <chr>   <chr>    <chr>        <int> <dbl>
#> 1 default FL_2610  bulk_density    15 0.870
#> 2 default FL_2610  bulk_density    47 1.08 
#> 3 default FL_2610  bulk_density    85 1.33 
#> 4 default FL_2610  tot_carbon      15 2.11 
#> 5 default FL_2610  tot_carbon      47 1.14 
#> 6 default FL_2610  tot_carbon      85 0.630
md %>% filter_locations(uid == 2610)
#> A mudata object aligned along "depth"
#>   distinct_datasets():  "default"
#>   distinct_locations(): "FL_2610"
#>   distinct_params():    "bulk_density", "tot_carbon"
#>   src_tbls():           "data", "locations" ... and 3 more
#> 
#> tbl_data() %>% head():
#> # A tibble: 6 x 5
#>   dataset location param        depth value
#>   <chr>   <chr>    <chr>        <int> <dbl>
#> 1 default FL_2610  bulk_density    15 0.870
#> 2 default FL_2610  bulk_density    47 1.08 
#> 3 default FL_2610  bulk_density    85 1.33 
#> 4 default FL_2610  tot_carbon      15 2.11 
#> 5 default FL_2610  tot_carbon      47 1.14 
#> 6 default FL_2610  tot_carbon      85 0.630
md %>% filter_locations(state == "FL")
#> A mudata object aligned along "depth"
#>   distinct_datasets():  "default"
#>   distinct_locations(): "FL_2610", "FL_2668" ... and 2 more
#>   distinct_params():    "bulk_density", "tot_carbon"
#>   src_tbls():           "data", "locations" ... and 3 more
#> 
#> tbl_data() %>% head():
#> # A tibble: 6 x 5
#>   dataset location param        depth value
#>   <chr>   <chr>    <chr>        <int> <dbl>
#> 1 default FL_2610  bulk_density    15 0.870
#> 2 default FL_2610  bulk_density    47 1.08 
#> 3 default FL_2610  bulk_density    85 1.33 
#> 4 default FL_2668  bulk_density     6 1.11 
#> 5 default FL_2668  bulk_density    39 1.10 
#> 6 default FL_2668  bulk_density    80 1.15
md %>% filter_data(depth < 10)
#> A mudata object aligned along "depth"
#>   distinct_datasets():  "default"
#>   distinct_locations(): "FL_2668", "FL_2669"
#>   distinct_params():    "bulk_density", "tot_carbon"
#>   src_tbls():           "data", "locations" ... and 3 more
#> 
#> tbl_data() %>% head():
#> # A tibble: 4 x 5
#>   dataset location param        depth  value
#>   <chr>   <chr>    <chr>        <int>  <dbl>
#> 1 default FL_2668  bulk_density     6  1.11 
#> 2 default FL_2669  bulk_density     9  0.300
#> 3 default FL_2668  tot_carbon       6  0.850
#> 4 default FL_2669  tot_carbon       9 18.9
md %>% tbl_data_wide()
#> # A tibble: 16 x 5
#>    dataset location depth bulk_density tot_carbon
#>  * <chr>   <chr>    <int>        <dbl>      <dbl>
#>  1 default FL_2610     15       0.870       2.11 
#>  2 default FL_2610     47       1.08        1.14 
#>  3 default FL_2610     85       1.33        0.630
#>  4 default FL_2668      6       1.11        0.850
#>  5 default FL_2668     39       1.10        0.210
#>  6 default FL_2668     80       1.15        0.720
#>  7 default FL_2668    100       1.50        0.180
#>  8 default FL_2669      9       0.300      18.9  
#>  9 default FL_2669     21       0.710       4.13 
#> 10 default FL_2669    125       1.12        0.850
#> 11 default FL_2672     13       0.0700     40.1  
#> 12 default FL_2672     37       0.220      27.0  
#> 13 default FL_2672    100       1.14        6.10 
#> 14 default SC_2611     28       0.600       3.45 
#> 15 default SC_2611    125       1.27        0.520
#> 16 default SC_2750     50       0.680       1.84
md %>% tbl_locations()
#> # A tibble: 6 x 5
#>   dataset location   uid state geometry                    
#>   <chr>   <chr>    <int> <fct> <simple_feature>            
#> 1 default FL_2610   2610 FL    c(-85.04427656, 30.15089533)
#> 2 default SC_2611   2611 SC    c(-80.45276742, 32.5344469) 
#> 3 default FL_2668   2668 FL    c(-80.741909, 28.6982345)   
#> 4 default FL_2669   2669 FL    c(-81.76579873, 29.66114007)
#> 5 default FL_2672   2672 FL    c(-80.7264533, 28.27577)    
#> 6 default SC_2750   2750 SC    c(-80.65901056, 32.39959155)

tempdir <- tempfile()
write_mudata_dir(md, tempdir)
read_mudata_dir(tempdir)
#> A mudata object aligned along "depth"
#>   distinct_datasets():  "default"
#>   distinct_locations(): "FL_2610", "FL_2668" ... and 4 more
#>   distinct_params():    "bulk_density", "tot_carbon"
#>   src_tbls():           "data", "locations" ... and 3 more
#> 
#> tbl_data() %>% head():
#> # A tibble: 6 x 5
#>   dataset location param        depth value
#>   <chr>   <chr>    <chr>        <int> <dbl>
#> 1 default FL_2610  bulk_density    15 0.870
#> 2 default FL_2610  bulk_density    47 1.08 
#> 3 default FL_2610  bulk_density    85 1.33 
#> 4 default SC_2611  bulk_density    28 0.600
#> 5 default SC_2611  bulk_density   125 1.27 
#> 6 default FL_2668  bulk_density     6 1.11

tim-salabim commented 5 years ago

Can this be closed?

jhollist commented 5 years ago

I think so...

r-spatial / discuss

Best practice for storing one-to-many data #20