globalgov / messydates

R package for Extended Date/Time Format (EDTF)
https://globalgov.github.io/messydates
Other
15 stars 1 forks source link

Create skimr template for messydates class #41

Closed BBieri closed 2 years ago

BBieri commented 2 years ago

See this resource. This will avoid unnecessary warnings when rendering the documentation for "many" packages and will allow us to get more information into the data documentation.

BBieri commented 2 years ago

A prerequisite to creating {skimr} templates for {messydate} objects is to define a set of summary functions that would work on vectors of messydates (e.g. in a dataframe of treaties/states observations). Here is an example:

Imagine looking at a dataframe from {manystates} and wondering what the maximal time range of states that is covered by the dataset is (i.e. whether the dataset contains states from 1990 to the present or from 1900 onwards). Since {messydates} account for uncertainty, the individual messydates in the date vector have to be resolved before computing the range of the vector (e.g. by applying max, min or mean). Then, we would compute the range of said vector and return the information to the user in the form of a {skimr} template for messydates.

Here is a list of the metrics I thought relevant to implement:

Below is a reprex which outlines some of the possibilities for these functions. Please let me know what you think of this @henriquesposito, @jaeltan, @jhollway.

BBieri commented 2 years ago
#### Adding a skimr report function for messydt classes ####

# Various skimr helper functions. The first step is to resolve them before
# taking the mean, max, min, etc of the resolved vector.
#
# There might be an easier way of doing this in a single function with a
# bunch of conditional statements.
#
# Note: the date argument is a vector and the output should always be a scalar.

library(messydates)

maxmax <- function(date) {
  max(as.Date(max(date))) # Note: Date class for now could be anything
                          # (chr, messydt, etc.)
}

minmin <- function(date) {
  min(as.Date(min(date)))
}

meanmean <- function(date) {
  mean(as.Date(mean(date)))
}

medianmedian <- function(date) {
  median(as.Date(median(date)))
}

meanmode <- function(date) {
  mean(as.Date(modal(date)))
}
# Computing a simple uncertainty measure for vectors. Expressed in number of days.
uncertainty <- function(date) {
  sum(as.integer(messyvar(date))) / length(date)
}
# Messyvariance computes the range of the possible uncertain dates.
messyvar <- function(date) {
  # Resolve
  resolved <- data.frame(as.Date(max(date)), as.Date(min(date)))
  # Compute uncertainty
  vec <- NULL
  for (i in 1:nrow(resolved)) {
    vec[i] <- resolved[i, 1] - resolved[i, 2]
  }
  vec
}

# Add skimmer for messydt class

get_skimmers.messydt <- function(column) {
  skimr::sfl(
    skim_type = "messydt",
    max = maxmax,
    min = minmin,
    mean = meanmean,
    median = medianmedian,
    mode = meanmode,
    uncertainty = uncertainty
  )
}

# Example

my_data <- data.frame(
  event = c("Event1", "Event2", "Event3"),
  messydates = as.character(2001:2003)
)
my_uncertain_data <- data.frame(
  event = c("Event1", "Event2", "Event3"),
  messydates = c("2001-01?", "2002-01?", "2003-01?")
)

my_data$messydates <- as_messydate(my_data$messydates)
my_uncertain_data$messydates <- as_messydate(my_uncertain_data$messydates)

# Test the skimr
skimr::skim(my_data)
Name my_data
Number of rows 3
Number of columns 2
_______________________
Column type frequency:
character 1
messydt 1
________________________
Group variables None

Data summary

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
event 0 1 6 6 0 3 0

Variable type: messydt

skim_variable n_missing complete_rate max min mean median mode uncertainty
messydates 0 1 2003-12-31 2001-01-01 2002-07-02 2002-07-02 2002-01-01 364
# Test the helper functions
maxmax(my_data$messydates)
#> [1] "2003-12-31"
minmin(my_data$messydates)
#> [1] "2001-01-01"
meanmean(my_data$messydates)
#> [1] "2002-07-02"
medianmedian(my_data$messydates)
#> [1] "2002-07-02"
meanmode(my_data$messydates)
#> [1] "2002-01-01"
messyvar(my_data$messydates)
#> [1] 364 364 364
uncertainty(my_data$messydates)
#> [1] 364
# With uncertainty
maxmax(my_uncertain_data$messydates)
#> [1] "2003-01-31"
minmin(my_uncertain_data$messydates)
#> [1] "2001-01-01"
meanmean(my_uncertain_data$messydates)
#> [1] "2002-01-16"
medianmedian(my_uncertain_data$messydates)
#> [1] "2002-01-16"
meanmode(my_uncertain_data$messydates)
#> [1] "2002-01-01"
messyvar(my_uncertain_data$messydates)
#> [1] 30 30 30
uncertainty(my_uncertain_data$messydates)
#> [1] 30

Created on 2022-02-21 by the reprex package (v2.0.1)

Session info ``` r sessioninfo::session_info() #> - Session info --------------------------------------------------------------- #> setting value #> version R version 4.1.2 (2021-11-01) #> os Windows 10 x64 (build 22000) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_Switzerland.1252 #> ctype English_Switzerland.1252 #> tz Europe/Berlin #> date 2022-02-21 #> pandoc 2.11.4 @ C:/Program Files/RStudio/bin/pandoc/ (via rmarkdown) #> #> - Packages ------------------------------------------------------------------- #> package * version date (UTC) lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0) #> backports 1.4.1 2021-12-13 [1] CRAN (R 4.1.2) #> base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.1.0) #> cli 3.1.1 2022-01-20 [1] CRAN (R 4.1.2) #> crayon 1.4.2 2021-10-29 [1] CRAN (R 4.1.1) #> DBI 1.1.2 2021-12-20 [1] CRAN (R 4.1.2) #> digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.2) #> dplyr 1.0.7 2021-06-18 [1] CRAN (R 4.1.2) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0) #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0) #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.2) #> generics 0.1.2 2022-01-31 [1] CRAN (R 4.1.2) #> glue 1.6.0 2021-12-17 [1] CRAN (R 4.1.2) #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0) #> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.1) #> jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.1.1) #> knitr 1.37 2021-12-16 [1] CRAN (R 4.1.2) #> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.2) #> lubridate 1.8.0 2021-10-07 [1] CRAN (R 4.1.2) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0) #> messydates * 0.2.0 2022-02-01 [1] local #> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.1.2) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0) #> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.1.2) #> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.1.1) #> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.1.1) #> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.1.2) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1) #> repr 1.1.4 2022-01-04 [1] CRAN (R 4.1.2) #> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.1) #> rlang 0.4.12 2021-10-18 [1] CRAN (R 4.1.1) #> rmarkdown 2.11 2021-09-14 [1] CRAN (R 4.1.1) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.2) #> skimr 2.1.3 2021-03-07 [1] CRAN (R 4.1.2) #> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.2) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0) #> styler 1.6.2 2021-09-23 [1] CRAN (R 4.1.2) #> tibble 3.1.6 2021-11-07 [1] CRAN (R 4.1.2) #> tidyr 1.2.0 2022-02-01 [1] CRAN (R 4.1.2) #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.2) #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0) #> withr 2.4.3 2021-11-30 [1] CRAN (R 4.1.1) #> xfun 0.29 2021-12-14 [1] CRAN (R 4.1.2) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0) #> #> [1] C:/Users/bjorn/Documents/R/win-library/4.1 #> [2] C:/Program Files/R/R-4.1.2/library #> #> ------------------------------------------------------------------------------ ```
henriquesposito commented 2 years ago

Thank you @BBieri , good job!

With this approach we might, however, not be able to "resolve" all types of messydates (i.e. ranges, sets, or negative dates). Neither we might be able to expand, contract, or operate in all of these types.

One way out of this is to build skimr template from the messydates methods/functions themselves so that we can resolve, expand, contract, and operate on these various types of dates for which messydates is made for. For example:

library(messydates)
get_skimmers.messydt <- function(x) {
  skimr::sfl(
    skim_type = "messydt",
    max = as.Date(x, max),
    min = as.Date(x, min),
    mean = as.Date(x, mean),
    median = as.Date(x, median),
    modal = as.Date(x, modal),
    random = as.Date(x, random)
  )
}

# Example (Is this an issue for tibbles as well?)
my_data <- tibble::tibble(event = c("Event1", "Event2", "Event3", "Event 4"),
                          dates = as_messydate(c("2001", "2001-01-01..2003-12-31", "{2001, 2002, 2003}", "33 BC")))

min(my_data$dates)
max(my_data$dates)
median(my_data$dates)
modal(my_data$dates)
mean(my_data$dates)
random(my_data$dates)
# skimr::skim(my_data$dates) does not work but I am sure we can find a way

What do you think?

BBieri commented 2 years ago

Hi Henrique!

Thanks for taking the time to create this example and for the feedback :)

Thanks for pointing out that max(my_data$dates) does not work for negative messydate vectors. I had missed that in my first attempt!

I have played around with the code a little more and {skimr} works with both tibbles and dataframes alike so no issue there. Finally, as.Date(x, max) yields "2001-12-31" "2003-12-31" "2003-12-31" "-033-12-31" i.e. the maximum of each messydate in a vector. My concern here is that users may not find this very useful when they explore data in the various many-packages. As a user, I would expect to be shown the maximum of the aforementioned vector to be able to see the temporal range covered by the dataset e.g. "2003-12-31".

Here is an updated reprex (The median function yields an unexpeded NA. I'll have a look at that tomorrow):

#### Adding a skimr report function for messydt classes ####

# Various skimr helper functions. The first step is to resolve them before
# taking the mean, max, min, etc of the resolved vector.
#
# There might be an easier way of doing this in a single function with a
# bunch of conditional statements.
#
# Note: the date argument is a vector and the output should always be a scalar.

library(messydates)

maxmax <- function(date) {
  max(as.Date(date, max)) # Note: Date class for now could be anything
                          # (chr, messydt, etc.)
}

minmin <- function(date) {
  min(as.Date(date, min))
}

meanmean <- function(date) {
  mean(as.Date(date, mean))
}

medianmedian <- function(date) {
  median(as.Date(date, median))
}

meanmode <- function(date) {
  mean(as.Date(date, modal))
}
# Computing a simple uncertainty measure for vectors. Expressed in number of days.
uncertainty <- function(date) {
  sum(as.integer(messyvar(date))) / length(date)
}
# Messyvariance computes the range of the possible uncertain dates.
messyvar <- function(date) {
  # Resolve
  resolved <- data.frame(as.Date(date, max), as.Date(date, min))
  # Compute uncertainty
  vec <- NULL
  for (i in 1:nrow(resolved)) {
    vec[i] <- resolved[i, 1] - resolved[i, 2]
  }
  vec
}

# Add skimmer method for messydt class

get_skimmers.messydt <- function(column) {
  skimr::sfl(
    skim_type = "messydt",
    max = maxmax,
    min = minmin,
    mean = meanmean,
    median = medianmedian,
    mode = meanmode,
    uncertainty = uncertainty
  )
}

# Example data
henriques_example <- tibble::tibble(event = c("Event1", "Event2",
                                             "Event3", "Event 4"),
                                   messydates = as_messydate(c("2001",
                                                  "2001-01-01..2003-12-31",
                                                  "{2001, 2002, 2003}",
                                                  "33 BC")))

# Test skimr
skimr::skim(henriques_example)
#> Warning: 1 failed to parse.

#> Warning: 1 failed to parse.
Name henriques_example
Number of rows 4
Number of columns 2
_______________________
Column type frequency:
character 1
messydt 1
________________________
Group variables None

Data summary

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
event 0 1 6 7 0 4 0

Variable type: messydt

skim_variable n_missing complete_rate max min mean median mode uncertainty
messydates 0 1 2003-12-31 -033-01-01 1493-07-01 NA 1492-07-02 729
# With Henrique's example
maxmax(henriques_example$messydates)
#> [1] "2003-12-31"
minmin(henriques_example$messydates)
#> [1] "-033-01-01"
meanmean(henriques_example$messydates)
#> [1] "1493-07-01"
medianmedian(henriques_example$messydates)
#> Warning: 1 failed to parse.

#> Warning: 1 failed to parse.
#> [1] NA
meanmode(henriques_example$messydates)
#> [1] "1492-07-02"
messyvar(henriques_example$messydates)
#> [1]  364 1094 1094  364
uncertainty(henriques_example$messydates)
#> [1] 729

Created on 2022-02-21 by the reprex package (v2.0.1)

Session info ``` r sessioninfo::session_info() #> - Session info --------------------------------------------------------------- #> setting value #> version R version 4.1.2 (2021-11-01) #> os Windows 10 x64 (build 22000) #> system x86_64, mingw32 #> ui RTerm #> language (EN) #> collate English_Switzerland.1252 #> ctype English_Switzerland.1252 #> tz Europe/Berlin #> date 2022-02-21 #> pandoc 2.11.4 @ C:/Program Files/RStudio/bin/pandoc/ (via rmarkdown) #> #> - Packages ------------------------------------------------------------------- #> package * version date (UTC) lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0) #> backports 1.4.1 2021-12-13 [1] CRAN (R 4.1.2) #> base64enc 0.1-3 2015-07-28 [1] CRAN (R 4.1.0) #> cli 3.1.1 2022-01-20 [1] CRAN (R 4.1.2) #> crayon 1.4.2 2021-10-29 [1] CRAN (R 4.1.1) #> DBI 1.1.2 2021-12-20 [1] CRAN (R 4.1.2) #> digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.2) #> dplyr 1.0.7 2021-06-18 [1] CRAN (R 4.1.2) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0) #> evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0) #> fansi 0.5.0 2021-05-25 [1] CRAN (R 4.1.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0) #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.2) #> generics 0.1.2 2022-01-31 [1] CRAN (R 4.1.2) #> glue 1.6.0 2021-12-17 [1] CRAN (R 4.1.2) #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0) #> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.1) #> jsonlite 1.7.2 2020-12-09 [1] CRAN (R 4.1.1) #> knitr 1.37 2021-12-16 [1] CRAN (R 4.1.2) #> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.2) #> lubridate 1.8.0 2021-10-07 [1] CRAN (R 4.1.2) #> magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0) #> messydates * 0.2.1 2022-02-21 [1] local #> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.1.2) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0) #> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.1.2) #> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.1.1) #> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.1.1) #> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.1.2) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.1) #> repr 1.1.4 2022-01-04 [1] CRAN (R 4.1.2) #> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.1) #> rlang 0.4.12 2021-10-18 [1] CRAN (R 4.1.1) #> rmarkdown 2.11 2021-09-14 [1] CRAN (R 4.1.1) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.2) #> skimr 2.1.3 2021-03-07 [1] CRAN (R 4.1.2) #> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.2) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0) #> styler 1.6.2 2021-09-23 [1] CRAN (R 4.1.2) #> tibble 3.1.6 2021-11-07 [1] CRAN (R 4.1.2) #> tidyr 1.2.0 2022-02-01 [1] CRAN (R 4.1.2) #> tidyselect 1.1.1 2021-04-30 [1] CRAN (R 4.1.0) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.2) #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0) #> withr 2.4.3 2021-11-30 [1] CRAN (R 4.1.1) #> xfun 0.29 2021-12-14 [1] CRAN (R 4.1.2) #> yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0) #> #> [1] C:/Users/bjorn/Documents/R/win-library/4.1 #> [2] C:/Program Files/R/R-4.1.2/library #> #> ------------------------------------------------------------------------------ ```
henriquesposito commented 2 years ago

@BBieri great job, thank you!

Just two very minor points:

  1. I am not sure we want one measure of "uncertainty" as a range of days for all messydates contained in a variable. Each messydate may have certain levels of uncertainty, or not, and that cannot be reflected in one row of a skimr report. I would just suggest we remove the uncertainty variable from the report.

  2. Is there a way we can make the report table and data summary a bit prettier? We can perhaps use {kablExtra} for this or simply modify the settings within skimr.

In any case, thank you for the great job already!

BBieri commented 2 years ago

Thanks again for the helpful feedback @henriquesposito ! Regarding your two points:

  1. I agree with you that the uncertainty measure might need some tweaking and should not be interpreted as a range of days but as some sort of index allowing for ordinal comparisons between datasets. I still think that such a metric is useful for users when performing exploratory data analysis. Any input @jaeltan and @jhollway ?

  2. Should be possible technically although it may be beyond the scope of this issue/package and might go against the "standardized reporting" approach {skimr} takes. Let me know if you had anything specific in mind :)

jaeltan commented 2 years ago

Hi @BBieri , thanks for all the work you've put into this!

I'm not sure what the uncertainty metric is here supposed to achieve - does it return the total number of dates possible for the variable? If that's the case I would agree with @henriquesposito that it's not very helpful... Maybe it could return the number of date entries in the variable that have uncertainty instead?

jhollway commented 2 years ago

A summary measure of uncertainty across a vector (or, rather, list of vectors) could be conceived of in different ways. In the meantime, why not something like a measure of entropy?

BBieri commented 2 years ago

I finally took some time (sorry for the delay on this) to implement a basic version in the package. Currently, the implemented functions are min (the non-NA date the furthest back in time of the vector), may(the non-NA date the furthest ahead in time of the vector), and mean (the mean non-NA date of the vector). I still need to think about how we want to deal with cases where we have observations of open-ended intervals of dates.

BBieri commented 2 years ago

Following issues with rendering the documentation of {manystates}, I have discovered that the newly defined methods used to compute statistics for the {skimr} report are very slow for larger datasets (e.g. more than a thousand dates). This is mainly due to bottlenecks in the way dates are resolved before a vector level statistic is computed on the resolved dates.

The solution I have implemented now yields a fast but relatively less informative report on messydt vectors and gets rid of the warnings when generating documentation of dataframes with messydt columns we have been getting in the past. Moving forward, I'll see whether it is possible to accelerate either the {skimr} methods or the resolve functions.

Thanks for the help @henriquesposito tracking down this documentation issue! :pray:

BBieri commented 2 years ago

I am putting this issue in the "awaiting deployment" pipeline since the {skimr} template was created. Moving forward, we need to accelerate the resolve methods to be able to display the documentation faster. See issue #49.