Closed maurolepore closed 6 years ago
I think the simple solution is ok. These filter reminded me that columns related to wood specific gravity (anything "wsg") don't need to show in the master table as we are retrieving wsg (wood density values) from a citable source (global wood density databse - BIOMAS packages)..we can discuss later
I changed my mind about how to approach this problem and I now think that the safest and cleanest way is to do as little as possible. That is, to give users all the information by storing the data in data/ as text. That way the different kinds of missing values will appear as entered (even in columns that are, for example, meant to be numeric). To use the tables we can then provide a helper that converts each column to the corresponding type. I'll soon demonstrate this with code.
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(allodb)
# All columns as text
allodb::wsg %>%
filter(sample_size == "NRA")
#> # A tibble: 16 x 8
#> wsg_id family species wsg wsg_specificity sample_size site ref_id
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 <NA> Rhamna~ Ceanothu~ 0.67 <NA> NRA UCSC <NA>
#> 2 <NA> Grossu~ Ribes di~ 0.73 <NA> NRA UCSC <NA>
#> 3 <NA> Ericac~ Vacciniu~ 0.47 <NA> NRA UCSC <NA>
#> 4 <NA> Rosace~ Holodisc~ 0.71 <NA> NRA Wind~ <NA>
#> 5 <NA> Rosace~ Rubus le~ NRA <NA> NRA Wind~ <NA>
#> 6 <NA> Rosace~ Rubus sp~ NRA <NA> NRA Wind~ <NA>
#> 7 <NA> Ericac~ Vacciniu~ 0.47 <NA> NRA Wind~ <NA>
#> 8 <NA> Ericac~ Vacciniu~ 0.47 <NA> NRA Wind~ <NA>
#> 9 <NA> Ericac~ Vacciniu~ 0.47 <NA> NRA Wind~ <NA>
#> 10 <NA> Ericac~ Arctosta~ 0.72 <NA> NRA Yose~ <NA>
#> 11 <NA> Rhamna~ Ceanothu~ 0.67 <NA> NRA Yose~ <NA>
#> 12 <NA> Rhamna~ Ceanothu~ 0.67 <NA> NRA Yose~ <NA>
#> 13 <NA> Rhamna~ Ceanothu~ 0.67 <NA> NRA Yose~ <NA>
#> 14 <NA> Rosace~ Holodisc~ 0.71 <NA> NRA Yose~ <NA>
#> 15 <NA> Grossu~ Ribes ne~ NRA <NA> NRA Yose~ <NA>
#> 16 <NA> Grossu~ Ribes ro~ NRA <NA> NRA Yose~ <NA>
# Preserves different representations of missing values, e.g. "NRA".
allodb::wsg %>%
filter(sample_size == "NRA")
#> # A tibble: 16 x 8
#> wsg_id family species wsg wsg_specificity sample_size site ref_id
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 <NA> Rhamna~ Ceanothu~ 0.67 <NA> NRA UCSC <NA>
#> 2 <NA> Grossu~ Ribes di~ 0.73 <NA> NRA UCSC <NA>
#> 3 <NA> Ericac~ Vacciniu~ 0.47 <NA> NRA UCSC <NA>
#> 4 <NA> Rosace~ Holodisc~ 0.71 <NA> NRA Wind~ <NA>
#> 5 <NA> Rosace~ Rubus le~ NRA <NA> NRA Wind~ <NA>
#> 6 <NA> Rosace~ Rubus sp~ NRA <NA> NRA Wind~ <NA>
#> 7 <NA> Ericac~ Vacciniu~ 0.47 <NA> NRA Wind~ <NA>
#> 8 <NA> Ericac~ Vacciniu~ 0.47 <NA> NRA Wind~ <NA>
#> 9 <NA> Ericac~ Vacciniu~ 0.47 <NA> NRA Wind~ <NA>
#> 10 <NA> Ericac~ Arctosta~ 0.72 <NA> NRA Yose~ <NA>
#> 11 <NA> Rhamna~ Ceanothu~ 0.67 <NA> NRA Yose~ <NA>
#> 12 <NA> Rhamna~ Ceanothu~ 0.67 <NA> NRA Yose~ <NA>
#> 13 <NA> Rhamna~ Ceanothu~ 0.67 <NA> NRA Yose~ <NA>
#> 14 <NA> Rosace~ Holodisc~ 0.71 <NA> NRA Yose~ <NA>
#> 15 <NA> Grossu~ Ribes ne~ NRA <NA> NRA Yose~ <NA>
#> 16 <NA> Grossu~ Ribes ro~ NRA <NA> NRA Yose~ <NA>
# All columns of the type that is most suitable for computation**
# E.g.: Notice that `sample_size` is integer.
as_allodb(allodb::wsg)
#> # A tibble: 419 x 8
#> wsg_id family species wsg wsg_specificity sample_size site ref_id
#> * <chr> <chr> <chr> <chr> <chr> <int> <chr> <chr>
#> 1 <NA> Sapind~ Acer rub~ 0.49 <NA> NA Lill~ <NA>
#> 2 <NA> Sapind~ Acer sac~ 0.56 <NA> NA Lill~ <NA>
#> 3 <NA> Rosace~ Amelanch~ 0.66 <NA> NA Lill~ <NA>
#> 4 <NA> Rosace~ Amelanch~ 0.66 <NA> NA Lill~ <NA>
#> 5 <NA> Rosace~ Amelanch~ 0.66 <NA> NA Lill~ <NA>
#> 6 <NA> Annona~ Asimina ~ 0.47 <NA> NA Lill~ <NA>
#> 7 <NA> Betula~ Carpinus~ 0.58 <NA> NA Lill~ <NA>
#> 8 <NA> Juglan~ Carya al~ 0.62 <NA> 10 Lill~ <NA>
#> 9 <NA> Juglan~ Carya co~ 0.6 <NA> 10 Lill~ <NA>
#> 10 <NA> Juglan~ Carya gl~ 0.66 <NA> 10 Lill~ <NA>
#> # ... with 409 more rows
# Weird representation of missing values are lost (e.g. no more "NRA"")
as_allodb(allodb::wsg) %>%
filter(sample_size == "NRA")
#> # A tibble: 0 x 8
#> # ... with 8 variables: wsg_id <chr>, family <chr>, species <chr>,
#> # wsg <chr>, wsg_specificity <chr>, sample_size <int>, site <chr>,
#> # ref_id <chr>
# ** Notice a possible bug: `wsg` should be double -- not character.
Created on 2018-09-25 by the reprex package (v0.2.1)
The master data contains different representations of missing values, which are described here:
Now there is a problem. If we specify all possible representations of missing values (e.g. via the argument
na
toread_csv()
), then we lose information of what kind of missing value each one it is.A simple solution, I think, is to represent the kind of missing value as a new column. The original representations will all be coerced to NA but we could identify what kind of NA it is using the new column.
Here are the columns that have some representation of missing values, and the corresponding kind: