Closed jhollist closed 5 years ago
Could be a nice stars example, embedding a 1D array per point.
The depth values are identical for each point, so they don't form an array.
I'd try a nested df:
library(dplyr)
# Attaching package: ‘dplyr’
# The following objects are masked from ‘package:stats’:
# filter, lag
# The following objects are masked from ‘package:base’:
# intersect, setdiff, setequal, union
library(sf)
# Linking to GEOS 3.5.1, GDAL 2.2.1, proj.4 4.9.3
sites_url <- "https://www.epa.gov/sites/production/files/2016-10/nwca2011_siteinfo.csv"
soil_chem_url <- "https://www.epa.gov/sites/production/files/2016-10/nwca2011_soilchem.csv"
nwca_sites_sf <- read.csv(sites_url) %>%
st_as_sf(coords = c("ANALYSIS_LON", "ANALYSIS_LAT"),
crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs") %>%
filter(UID %in% c(2610,2611,2668,2669,2672,2750)) %>%
select(uid = UID, state = STATE)
nwca_carbon <- read.csv(soil_chem_url) %>%
filter(VISIT_NO == 1) %>%
filter(UID %in% c(2610,2611,2668,2669,2672,2750)) %>%
select(uid = UID, state = STATE, depth = DEPTH, bulk_density = BULK_DEN_DBF,
tot_carbon = TOT_CARBON)
library(tidyr)
nwca_carbon %>% group_by(uid) %>% nest() %>% left_join(nwca_sites_sf)
# Joining, by = "uid"
# # A tibble: 6 x 4
# uid data state geometry
# <int> <list> <fct> <POINT [°]>
# 1 2610 <tibble [3 × 4]> FL (-85.04428 30.1509)
# 2 2611 <tibble [2 × 4]> SC (-80.45277 32.53445)
# 3 2668 <tibble [4 × 4]> FL (-80.74191 28.69823)
# 4 2669 <tibble [3 × 4]> FL (-81.7658 29.66114)
# 5 2672 <tibble [3 × 4]> FL (-80.72645 28.27577)
# 6 2750 <tibble [1 × 4]> SC (-80.65901 32.39959)
Question is: what to do next?
The depth values shouldn't be identical in this case... Am I missing something?
That being said, the nested df looks like it will work really well and I think solves my original question (and I learned tidyr::nest!)
This is a common enough thing (I think) that some documentation of this would be useful. Not sure it rises to level of addition to sf
or another package. If I get time, I can write up a post for my neglected blog.
There is btw nothing principled against treating these data as features with 3D geometries. It just depends on what you want to do with them, next.
library(dplyr)
# Attaching package: ‘dplyr’
# The following objects are masked from ‘package:stats’:
# filter, lag
# The following objects are masked from ‘package:base’:
# intersect, setdiff, setequal, union
library(sf)
# Linking to GEOS 3.5.1, GDAL 2.2.1, proj.4 4.9.3
sites_url <- "https://www.epa.gov/sites/production/files/2016-10/nwca2011_siteinfo.csv"
soil_chem_url <- "https://www.epa.gov/sites/production/files/2016-10/nwca2011_soilchem.csv"
nwca_sites <- read.csv(sites_url) %>%
filter(UID %in% c(2610,2611,2668,2669,2672,2750)) %>%
select(uid = UID, state = STATE, lon = ANALYSIS_LON, lat = ANALYSIS_LAT)
nwca_carbon <- read.csv(soil_chem_url) %>%
filter(VISIT_NO == 1) %>%
filter(UID %in% c(2610,2611,2668,2669,2672,2750)) %>%
select(uid = UID, state = STATE, depth = DEPTH, bulk_density = BULK_DEN_DBF,
tot_carbon = TOT_CARBON)
library(tidyr)
j <- left_join(nwca_sites, nwca_carbon)
# Joining, by = c("uid", "state")
pts3D <- j %>% select(lon, lat, depth) %>% t %>% as.data.frame %>% lapply(st_point) %>% st_sfc(crs = 4326)
(sf <- j %>% select(-lon, -lat, -depth) %>% st_sf(pts3D))
# Simple feature collection with 16 features and 4 fields
# geometry type: POINT
# dimension: XYZ
# bbox: xmin: -85.04428 ymin: 28.27577 xmax: -80.45277 ymax: 32.53445
# epsg (SRID): 4326
# proj4string: +proj=longlat +datum=WGS84 +no_defs
# First 10 features:
# uid state bulk_density tot_carbon pts3D
# 1 2610 FL 0.87 2.11 POINT Z (-85.04428 30.1509 15)
# 2 2610 FL 1.08 1.14 POINT Z (-85.04428 30.1509 47)
# 3 2610 FL 1.33 0.63 POINT Z (-85.04428 30.1509 85)
# 4 2611 SC 0.60 3.45 POINT Z (-80.45277 32.53445...
# 5 2611 SC 1.27 0.52 POINT Z (-80.45277 32.53445...
# 6 2668 FL 1.11 0.85 POINT Z (-80.74191 28.69823 6)
# 7 2668 FL 1.10 0.21 POINT Z (-80.74191 28.69823...
# 8 2668 FL 1.15 0.72 POINT Z (-80.74191 28.69823...
# 9 2668 FL 1.50 0.18 POINT Z (-80.74191 28.69823...
# 10 2669 FL 0.30 18.91 POINT Z (-81.7658 29.66114 9)
I was thinking of it the other way, one row for each XY and an embedded 1D stars array (each in its own local depth-1D-geometry at the XY site). The stars array is multi-attributed, with uid, state, depth, bulk_density, tot_carbon
, and depth is the special 1D raster axis. That way there's a total separation of site from the profile topology, and no duplication of the XY sites - but, exactly as you say, it depends what's next.
I do think this is a good example, and worth discussing the various approaches!
@jhollist Though not entirely on-topic here, I use tidyr::nest()
and sf all the time together. It really facilitates having your spatial data in exactly the same place as your parameters which you have then the ability do all sorts of wonderful things with purrr
. That's what I originally meant on twitter. A trivial example here:
library(tidyr)
library(purrr)
library(broom)
nested_nwca_carbon <- nwca_carbon %>%
group_by(uid) %>%
nest() %>%
left_join(nwca_sites_sf) %>%
mutate(mean_carbon = map2_dbl(data, state, ~mean(.x$tot_carbon))) %>%
mutate(model = map2(data, state, ~lm(tot_carbon ~ depth, data = .x))) %>%
mutate(r.squared = map_dbl(model, ~glance(.x) %>% pull(r.squared)))
Which then can be immediately mapped to a plot:
library(ggplot2) ##dev
ggplot() +
geom_sf(data = nested_nwca_carbon, aes(colour = r.squared, size = mean_carbon))
Thanks for the examples, @boshek, especially the purrr. I've been very slowly adding that to my bag of tricks.
Given the discussion here and on twitter there are a lot of possibilities:
mudata2
package (suggested on twitter by @paleolimbot, but I haven't had a chance to look: https://github.com/paleolimbot/mudata)This has been an incredibly useful discussion for me. So thanks all!
Here's your above example using mudata2, in the off chance it's useful...it's basically trying to keep things together as best it can, including when reading/writing to a directory or JSON file. Because there are some irregularities with tidyverse methods and sf
objects, the locations table is just a tibble with an sfc
column rather than an sf
object...but eventually it will be an sf
(if there is exactly one geometry column).
library(dplyr)
library(sf)
#> Linking to GEOS 3.6.1, GDAL 2.1.3, proj.4 4.9.3
sites_url <- "https://www.epa.gov/sites/production/files/2016-10/nwca2011_siteinfo.csv"
soil_chem_url <- "https://www.epa.gov/sites/production/files/2016-10/nwca2011_soilchem.csv"
nwca_sites_sf <- read.csv(sites_url) %>%
st_as_sf(coords = c("ANALYSIS_LON", "ANALYSIS_LAT"),
crs = "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs") %>%
filter(UID %in% c(2610,2611,2668,2669,2672,2750)) %>%
select(uid = UID, state = STATE)
nwca_carbon <- read.csv(soil_chem_url) %>%
filter(VISIT_NO == 1) %>%
filter(UID %in% c(2610,2611,2668,2669,2672,2750)) %>%
select(uid = UID, state = STATE, depth = DEPTH, bulk_density = BULK_DEN_DBF,
tot_carbon = TOT_CARBON)
library(mudata2)
md <- mudata(
# data table is a parameter-long table with one row per measurement
# and a character "location" identifier
data = nwca_carbon %>%
mutate(location = paste(state, uid, sep = "_")) %>%
select(-uid, -state) %>%
tidyr::gather(-location, -depth, key = "param", value = "value"),
# locations table can be any tbl, including an sf object
# needs the same character "location" identifier
locations = nwca_sites_sf %>%
mutate(location = paste(state, uid, sep = "_"))
)
#> Guessing x columns: depth
md
#> A mudata object aligned along "depth"
#> distinct_datasets(): "default"
#> distinct_locations(): "FL_2610", "FL_2668" ... and 4 more
#> distinct_params(): "bulk_density", "tot_carbon"
#> src_tbls(): "data", "locations" ... and 3 more
#>
#> tbl_data() %>% head():
#> # A tibble: 6 x 5
#> dataset location param depth value
#> <chr> <chr> <chr> <int> <dbl>
#> 1 default FL_2610 bulk_density 15 0.870
#> 2 default FL_2610 bulk_density 47 1.08
#> 3 default FL_2610 bulk_density 85 1.33
#> 4 default SC_2611 bulk_density 28 0.600
#> 5 default SC_2611 bulk_density 125 1.27
#> 6 default FL_2668 bulk_density 6 1.11
md %>% select_locations(FL_2610)
#> A mudata object aligned along "depth"
#> distinct_datasets(): "default"
#> distinct_locations(): "FL_2610"
#> distinct_params(): "bulk_density", "tot_carbon"
#> src_tbls(): "data", "locations" ... and 3 more
#>
#> tbl_data() %>% head():
#> # A tibble: 6 x 5
#> dataset location param depth value
#> <chr> <chr> <chr> <int> <dbl>
#> 1 default FL_2610 bulk_density 15 0.870
#> 2 default FL_2610 bulk_density 47 1.08
#> 3 default FL_2610 bulk_density 85 1.33
#> 4 default FL_2610 tot_carbon 15 2.11
#> 5 default FL_2610 tot_carbon 47 1.14
#> 6 default FL_2610 tot_carbon 85 0.630
md %>% filter_locations(uid == 2610)
#> A mudata object aligned along "depth"
#> distinct_datasets(): "default"
#> distinct_locations(): "FL_2610"
#> distinct_params(): "bulk_density", "tot_carbon"
#> src_tbls(): "data", "locations" ... and 3 more
#>
#> tbl_data() %>% head():
#> # A tibble: 6 x 5
#> dataset location param depth value
#> <chr> <chr> <chr> <int> <dbl>
#> 1 default FL_2610 bulk_density 15 0.870
#> 2 default FL_2610 bulk_density 47 1.08
#> 3 default FL_2610 bulk_density 85 1.33
#> 4 default FL_2610 tot_carbon 15 2.11
#> 5 default FL_2610 tot_carbon 47 1.14
#> 6 default FL_2610 tot_carbon 85 0.630
md %>% filter_locations(state == "FL")
#> A mudata object aligned along "depth"
#> distinct_datasets(): "default"
#> distinct_locations(): "FL_2610", "FL_2668" ... and 2 more
#> distinct_params(): "bulk_density", "tot_carbon"
#> src_tbls(): "data", "locations" ... and 3 more
#>
#> tbl_data() %>% head():
#> # A tibble: 6 x 5
#> dataset location param depth value
#> <chr> <chr> <chr> <int> <dbl>
#> 1 default FL_2610 bulk_density 15 0.870
#> 2 default FL_2610 bulk_density 47 1.08
#> 3 default FL_2610 bulk_density 85 1.33
#> 4 default FL_2668 bulk_density 6 1.11
#> 5 default FL_2668 bulk_density 39 1.10
#> 6 default FL_2668 bulk_density 80 1.15
md %>% filter_data(depth < 10)
#> A mudata object aligned along "depth"
#> distinct_datasets(): "default"
#> distinct_locations(): "FL_2668", "FL_2669"
#> distinct_params(): "bulk_density", "tot_carbon"
#> src_tbls(): "data", "locations" ... and 3 more
#>
#> tbl_data() %>% head():
#> # A tibble: 4 x 5
#> dataset location param depth value
#> <chr> <chr> <chr> <int> <dbl>
#> 1 default FL_2668 bulk_density 6 1.11
#> 2 default FL_2669 bulk_density 9 0.300
#> 3 default FL_2668 tot_carbon 6 0.850
#> 4 default FL_2669 tot_carbon 9 18.9
md %>% tbl_data_wide()
#> # A tibble: 16 x 5
#> dataset location depth bulk_density tot_carbon
#> * <chr> <chr> <int> <dbl> <dbl>
#> 1 default FL_2610 15 0.870 2.11
#> 2 default FL_2610 47 1.08 1.14
#> 3 default FL_2610 85 1.33 0.630
#> 4 default FL_2668 6 1.11 0.850
#> 5 default FL_2668 39 1.10 0.210
#> 6 default FL_2668 80 1.15 0.720
#> 7 default FL_2668 100 1.50 0.180
#> 8 default FL_2669 9 0.300 18.9
#> 9 default FL_2669 21 0.710 4.13
#> 10 default FL_2669 125 1.12 0.850
#> 11 default FL_2672 13 0.0700 40.1
#> 12 default FL_2672 37 0.220 27.0
#> 13 default FL_2672 100 1.14 6.10
#> 14 default SC_2611 28 0.600 3.45
#> 15 default SC_2611 125 1.27 0.520
#> 16 default SC_2750 50 0.680 1.84
md %>% tbl_locations()
#> # A tibble: 6 x 5
#> dataset location uid state geometry
#> <chr> <chr> <int> <fct> <simple_feature>
#> 1 default FL_2610 2610 FL c(-85.04427656, 30.15089533)
#> 2 default SC_2611 2611 SC c(-80.45276742, 32.5344469)
#> 3 default FL_2668 2668 FL c(-80.741909, 28.6982345)
#> 4 default FL_2669 2669 FL c(-81.76579873, 29.66114007)
#> 5 default FL_2672 2672 FL c(-80.7264533, 28.27577)
#> 6 default SC_2750 2750 SC c(-80.65901056, 32.39959155)
tempdir <- tempfile()
write_mudata_dir(md, tempdir)
read_mudata_dir(tempdir)
#> A mudata object aligned along "depth"
#> distinct_datasets(): "default"
#> distinct_locations(): "FL_2610", "FL_2668" ... and 4 more
#> distinct_params(): "bulk_density", "tot_carbon"
#> src_tbls(): "data", "locations" ... and 3 more
#>
#> tbl_data() %>% head():
#> # A tibble: 6 x 5
#> dataset location param depth value
#> <chr> <chr> <chr> <int> <dbl>
#> 1 default FL_2610 bulk_density 15 0.870
#> 2 default FL_2610 bulk_density 47 1.08
#> 3 default FL_2610 bulk_density 85 1.33
#> 4 default SC_2611 bulk_density 28 0.600
#> 5 default SC_2611 bulk_density 125 1.27
#> 6 default FL_2668 bulk_density 6 1.11
Can this be closed?
I think so...
Moving discussion started at https://twitter.com/jhollist/status/977254736163491842
My question was what is best way to store spatial data with a single feature but multiple rows of associated data. My use case is for a sample location that has profile data. A soil core or water quality profile are both good examples of this.
A list column was proposed as a possible solution.
Some code to grab a sample dataset of soil profiles from the EPA National Coastal Wetland Assessment.
uid
is the common key