Store different representations of missing values in a new column?

maurolepore commented 6 years ago

The master data contains different representations of missing values, which are described here:

Now there is a problem. If we specify all possible representations of missing values (e.g. via the argument na to read_csv()), then we lose information of what kind of missing value each one it is.

A simple solution, I think, is to represent the kind of missing value as a new column. The original representations will all be coerced to NA but we could identify what kind of NA it is using the new column.

Here are the columns that have some representation of missing values, and the corresponding kind:

$`wsg`
[1] "NRA"

$wsg_id
[1] NA

$wsg_specificity
[1] NA

$c
[1] NA

$d
[1] NA

$dbh_min_cm
[1] "NI"

$dbh_max_cm
[1] NA   "NI"

$sample_size
[1] NA    "NRA"

$equation_id
[1] NA

$regression_model
[1] NA

$other_equations_tested
[1] NA    "NRA"

$log_biomass
[1] NA

$bias_corrected
[1] NA

$bias_correction_factor
[1] NA    "NRA"

$notes_fitting_model
[1] NA

$development_species
[1] NA

$ref_id
[1] NA

$wsg_source
[1] NA

$ref_wsg_id
[1] NA

$original_data_availability
[1] NA

$notes_to_consider
[1] NA

$warning
[1] NA

gonzalezeb commented 6 years ago

I think the simple solution is ok. These filter reminded me that columns related to wood specific gravity (anything "wsg") don't need to show in the master table as we are retrieving wsg (wood density values) from a citable source (global wood density databse - BIOMAS packages)..we can discuss later

maurolepore commented 6 years ago

I changed my mind about how to approach this problem and I now think that the safest and cleanest way is to do as little as possible. That is, to give users all the information by storing the data in data/ as text. That way the different kinds of missing values will appear as entered (even in columns that are, for example, meant to be numeric). To use the tables we can then provide a helper that converts each column to the corresponding type. I'll soon demonstrate this with code.

maurolepore commented 6 years ago

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(allodb)

# All columns as text
allodb::wsg %>% 
  filter(sample_size == "NRA")
#> # A tibble: 16 x 8
#>    wsg_id family  species   wsg   wsg_specificity sample_size site  ref_id
#>    <chr>  <chr>   <chr>     <chr> <chr>           <chr>       <chr> <chr> 
#>  1 <NA>   Rhamna~ Ceanothu~ 0.67  <NA>            NRA         UCSC  <NA>  
#>  2 <NA>   Grossu~ Ribes di~ 0.73  <NA>            NRA         UCSC  <NA>  
#>  3 <NA>   Ericac~ Vacciniu~ 0.47  <NA>            NRA         UCSC  <NA>  
#>  4 <NA>   Rosace~ Holodisc~ 0.71  <NA>            NRA         Wind~ <NA>  
#>  5 <NA>   Rosace~ Rubus le~ NRA   <NA>            NRA         Wind~ <NA>  
#>  6 <NA>   Rosace~ Rubus sp~ NRA   <NA>            NRA         Wind~ <NA>  
#>  7 <NA>   Ericac~ Vacciniu~ 0.47  <NA>            NRA         Wind~ <NA>  
#>  8 <NA>   Ericac~ Vacciniu~ 0.47  <NA>            NRA         Wind~ <NA>  
#>  9 <NA>   Ericac~ Vacciniu~ 0.47  <NA>            NRA         Wind~ <NA>  
#> 10 <NA>   Ericac~ Arctosta~ 0.72  <NA>            NRA         Yose~ <NA>  
#> 11 <NA>   Rhamna~ Ceanothu~ 0.67  <NA>            NRA         Yose~ <NA>  
#> 12 <NA>   Rhamna~ Ceanothu~ 0.67  <NA>            NRA         Yose~ <NA>  
#> 13 <NA>   Rhamna~ Ceanothu~ 0.67  <NA>            NRA         Yose~ <NA>  
#> 14 <NA>   Rosace~ Holodisc~ 0.71  <NA>            NRA         Yose~ <NA>  
#> 15 <NA>   Grossu~ Ribes ne~ NRA   <NA>            NRA         Yose~ <NA>  
#> 16 <NA>   Grossu~ Ribes ro~ NRA   <NA>            NRA         Yose~ <NA>

# Preserves different representations of missing values, e.g. "NRA".
allodb::wsg %>% 
  filter(sample_size == "NRA")
#> # A tibble: 16 x 8
#>    wsg_id family  species   wsg   wsg_specificity sample_size site  ref_id
#>    <chr>  <chr>   <chr>     <chr> <chr>           <chr>       <chr> <chr> 
#>  1 <NA>   Rhamna~ Ceanothu~ 0.67  <NA>            NRA         UCSC  <NA>  
#>  2 <NA>   Grossu~ Ribes di~ 0.73  <NA>            NRA         UCSC  <NA>  
#>  3 <NA>   Ericac~ Vacciniu~ 0.47  <NA>            NRA         UCSC  <NA>  
#>  4 <NA>   Rosace~ Holodisc~ 0.71  <NA>            NRA         Wind~ <NA>  
#>  5 <NA>   Rosace~ Rubus le~ NRA   <NA>            NRA         Wind~ <NA>  
#>  6 <NA>   Rosace~ Rubus sp~ NRA   <NA>            NRA         Wind~ <NA>  
#>  7 <NA>   Ericac~ Vacciniu~ 0.47  <NA>            NRA         Wind~ <NA>  
#>  8 <NA>   Ericac~ Vacciniu~ 0.47  <NA>            NRA         Wind~ <NA>  
#>  9 <NA>   Ericac~ Vacciniu~ 0.47  <NA>            NRA         Wind~ <NA>  
#> 10 <NA>   Ericac~ Arctosta~ 0.72  <NA>            NRA         Yose~ <NA>  
#> 11 <NA>   Rhamna~ Ceanothu~ 0.67  <NA>            NRA         Yose~ <NA>  
#> 12 <NA>   Rhamna~ Ceanothu~ 0.67  <NA>            NRA         Yose~ <NA>  
#> 13 <NA>   Rhamna~ Ceanothu~ 0.67  <NA>            NRA         Yose~ <NA>  
#> 14 <NA>   Rosace~ Holodisc~ 0.71  <NA>            NRA         Yose~ <NA>  
#> 15 <NA>   Grossu~ Ribes ne~ NRA   <NA>            NRA         Yose~ <NA>  
#> 16 <NA>   Grossu~ Ribes ro~ NRA   <NA>            NRA         Yose~ <NA>

# All columns of the type that is most suitable for computation**
# E.g.: Notice that `sample_size` is integer.
as_allodb(allodb::wsg)
#> # A tibble: 419 x 8
#>    wsg_id family  species   wsg   wsg_specificity sample_size site  ref_id
#>  * <chr>  <chr>   <chr>     <chr> <chr>                 <int> <chr> <chr> 
#>  1 <NA>   Sapind~ Acer rub~ 0.49  <NA>                     NA Lill~ <NA>  
#>  2 <NA>   Sapind~ Acer sac~ 0.56  <NA>                     NA Lill~ <NA>  
#>  3 <NA>   Rosace~ Amelanch~ 0.66  <NA>                     NA Lill~ <NA>  
#>  4 <NA>   Rosace~ Amelanch~ 0.66  <NA>                     NA Lill~ <NA>  
#>  5 <NA>   Rosace~ Amelanch~ 0.66  <NA>                     NA Lill~ <NA>  
#>  6 <NA>   Annona~ Asimina ~ 0.47  <NA>                     NA Lill~ <NA>  
#>  7 <NA>   Betula~ Carpinus~ 0.58  <NA>                     NA Lill~ <NA>  
#>  8 <NA>   Juglan~ Carya al~ 0.62  <NA>                     10 Lill~ <NA>  
#>  9 <NA>   Juglan~ Carya co~ 0.6   <NA>                     10 Lill~ <NA>  
#> 10 <NA>   Juglan~ Carya gl~ 0.66  <NA>                     10 Lill~ <NA>  
#> # ... with 409 more rows

# Weird representation of missing values are lost (e.g. no more "NRA"")
as_allodb(allodb::wsg) %>% 
  filter(sample_size == "NRA")
#> # A tibble: 0 x 8
#> # ... with 8 variables: wsg_id <chr>, family <chr>, species <chr>,
#> #   wsg <chr>, wsg_specificity <chr>, sample_size <int>, site <chr>,
#> #   ref_id <chr>

# ** Notice a possible bug: `wsg` should be double -- not character.

Created on 2018-09-25 by the reprex package (v0.2.1)

ropensci / allodb

Store different representations of missing values in a new column? #45