Closed BBieri closed 2 years ago
A prerequisite to creating {skimr}
templates for {messydate}
objects is to define a set of summary functions that would work on vectors of messydates (e.g. in a dataframe of treaties/states observations). Here is an example:
Imagine looking at a dataframe from {manystates}
and wondering what the maximal time range of states that is covered by the dataset is (i.e. whether the dataset contains states from 1990 to the present or from 1900 onwards). Since {messydates}
account for uncertainty, the individual messydates in the date vector have to be resolved before computing the range of the vector (e.g. by applying max, min or mean). Then, we would compute the range of said vector and return the information to the user in the form of a {skimr}
template for messydates.
Here is a list of the metrics I thought relevant to implement:
Below is a reprex which outlines some of the possibilities for these functions. Please let me know what you think of this @henriquesposito, @jaeltan, @jhollway.
#### Adding a skimr report function for messydt classes ####
# Various skimr helper functions. The first step is to resolve them before
# taking the mean, max, min, etc of the resolved vector.
#
# There might be an easier way of doing this in a single function with a
# bunch of conditional statements.
#
# Note: the date argument is a vector and the output should always be a scalar.
library(messydates)
maxmax <- function(date) {
max(as.Date(max(date))) # Note: Date class for now could be anything
# (chr, messydt, etc.)
}
minmin <- function(date) {
min(as.Date(min(date)))
}
meanmean <- function(date) {
mean(as.Date(mean(date)))
}
medianmedian <- function(date) {
median(as.Date(median(date)))
}
meanmode <- function(date) {
mean(as.Date(modal(date)))
}
# Computing a simple uncertainty measure for vectors. Expressed in number of days.
uncertainty <- function(date) {
sum(as.integer(messyvar(date))) / length(date)
}
# Messyvariance computes the range of the possible uncertain dates.
messyvar <- function(date) {
# Resolve
resolved <- data.frame(as.Date(max(date)), as.Date(min(date)))
# Compute uncertainty
vec <- NULL
for (i in 1:nrow(resolved)) {
vec[i] <- resolved[i, 1] - resolved[i, 2]
}
vec
}
# Add skimmer for messydt class
get_skimmers.messydt <- function(column) {
skimr::sfl(
skim_type = "messydt",
max = maxmax,
min = minmin,
mean = meanmean,
median = medianmedian,
mode = meanmode,
uncertainty = uncertainty
)
}
# Example
my_data <- data.frame(
event = c("Event1", "Event2", "Event3"),
messydates = as.character(2001:2003)
)
my_uncertain_data <- data.frame(
event = c("Event1", "Event2", "Event3"),
messydates = c("2001-01?", "2002-01?", "2003-01?")
)
my_data$messydates <- as_messydate(my_data$messydates)
my_uncertain_data$messydates <- as_messydate(my_uncertain_data$messydates)
# Test the skimr
skimr::skim(my_data)
Name | my_data |
Number of rows | 3 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
character | 1 |
messydt | 1 |
________________________ | |
Group variables | None |
Data summary
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
event | 0 | 1 | 6 | 6 | 0 | 3 | 0 |
Variable type: messydt
skim_variable | n_missing | complete_rate | max | min | mean | median | mode | uncertainty |
---|---|---|---|---|---|---|---|---|
messydates | 0 | 1 | 2003-12-31 | 2001-01-01 | 2002-07-02 | 2002-07-02 | 2002-01-01 | 364 |
# Test the helper functions
maxmax(my_data$messydates)
#> [1] "2003-12-31"
minmin(my_data$messydates)
#> [1] "2001-01-01"
meanmean(my_data$messydates)
#> [1] "2002-07-02"
medianmedian(my_data$messydates)
#> [1] "2002-07-02"
meanmode(my_data$messydates)
#> [1] "2002-01-01"
messyvar(my_data$messydates)
#> [1] 364 364 364
uncertainty(my_data$messydates)
#> [1] 364
# With uncertainty
maxmax(my_uncertain_data$messydates)
#> [1] "2003-01-31"
minmin(my_uncertain_data$messydates)
#> [1] "2001-01-01"
meanmean(my_uncertain_data$messydates)
#> [1] "2002-01-16"
medianmedian(my_uncertain_data$messydates)
#> [1] "2002-01-16"
meanmode(my_uncertain_data$messydates)
#> [1] "2002-01-01"
messyvar(my_uncertain_data$messydates)
#> [1] 30 30 30
uncertainty(my_uncertain_data$messydates)
#> [1] 30
Created on 2022-02-21 by the reprex package (v2.0.1)
Thank you @BBieri , good job!
With this approach we might, however, not be able to "resolve" all types of messydates (i.e. ranges, sets, or negative dates). Neither we might be able to expand, contract, or operate in all of these types.
One way out of this is to build skimr template from the messydates methods/functions themselves so that we can resolve, expand, contract, and operate on these various types of dates for which messydates is made for. For example:
library(messydates)
get_skimmers.messydt <- function(x) {
skimr::sfl(
skim_type = "messydt",
max = as.Date(x, max),
min = as.Date(x, min),
mean = as.Date(x, mean),
median = as.Date(x, median),
modal = as.Date(x, modal),
random = as.Date(x, random)
)
}
# Example (Is this an issue for tibbles as well?)
my_data <- tibble::tibble(event = c("Event1", "Event2", "Event3", "Event 4"),
dates = as_messydate(c("2001", "2001-01-01..2003-12-31", "{2001, 2002, 2003}", "33 BC")))
min(my_data$dates)
max(my_data$dates)
median(my_data$dates)
modal(my_data$dates)
mean(my_data$dates)
random(my_data$dates)
# skimr::skim(my_data$dates) does not work but I am sure we can find a way
What do you think?
Hi Henrique!
Thanks for taking the time to create this example and for the feedback :)
Thanks for pointing out that max(my_data$dates)
does not work for negative messydate vectors. I had missed that in my first attempt!
I have played around with the code a little more and {skimr}
works with both tibbles and dataframes alike so no issue there. Finally, as.Date(x, max)
yields "2001-12-31" "2003-12-31" "2003-12-31" "-033-12-31"
i.e. the maximum of each messydate in a vector. My concern here is that users may not find this very useful when they explore data in the various many-packages. As a user, I would expect to be shown the maximum of the aforementioned vector to be able to see the temporal range covered by the dataset e.g. "2003-12-31".
Here is an updated reprex (The median function yields an unexpeded NA. I'll have a look at that tomorrow):
#### Adding a skimr report function for messydt classes ####
# Various skimr helper functions. The first step is to resolve them before
# taking the mean, max, min, etc of the resolved vector.
#
# There might be an easier way of doing this in a single function with a
# bunch of conditional statements.
#
# Note: the date argument is a vector and the output should always be a scalar.
library(messydates)
maxmax <- function(date) {
max(as.Date(date, max)) # Note: Date class for now could be anything
# (chr, messydt, etc.)
}
minmin <- function(date) {
min(as.Date(date, min))
}
meanmean <- function(date) {
mean(as.Date(date, mean))
}
medianmedian <- function(date) {
median(as.Date(date, median))
}
meanmode <- function(date) {
mean(as.Date(date, modal))
}
# Computing a simple uncertainty measure for vectors. Expressed in number of days.
uncertainty <- function(date) {
sum(as.integer(messyvar(date))) / length(date)
}
# Messyvariance computes the range of the possible uncertain dates.
messyvar <- function(date) {
# Resolve
resolved <- data.frame(as.Date(date, max), as.Date(date, min))
# Compute uncertainty
vec <- NULL
for (i in 1:nrow(resolved)) {
vec[i] <- resolved[i, 1] - resolved[i, 2]
}
vec
}
# Add skimmer method for messydt class
get_skimmers.messydt <- function(column) {
skimr::sfl(
skim_type = "messydt",
max = maxmax,
min = minmin,
mean = meanmean,
median = medianmedian,
mode = meanmode,
uncertainty = uncertainty
)
}
# Example data
henriques_example <- tibble::tibble(event = c("Event1", "Event2",
"Event3", "Event 4"),
messydates = as_messydate(c("2001",
"2001-01-01..2003-12-31",
"{2001, 2002, 2003}",
"33 BC")))
# Test skimr
skimr::skim(henriques_example)
#> Warning: 1 failed to parse.
#> Warning: 1 failed to parse.
Name | henriques_example |
Number of rows | 4 |
Number of columns | 2 |
_______________________ | |
Column type frequency: | |
character | 1 |
messydt | 1 |
________________________ | |
Group variables | None |
Data summary
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
event | 0 | 1 | 6 | 7 | 0 | 4 | 0 |
Variable type: messydt
skim_variable | n_missing | complete_rate | max | min | mean | median | mode | uncertainty |
---|---|---|---|---|---|---|---|---|
messydates | 0 | 1 | 2003-12-31 | -033-01-01 | 1493-07-01 | NA | 1492-07-02 | 729 |
# With Henrique's example
maxmax(henriques_example$messydates)
#> [1] "2003-12-31"
minmin(henriques_example$messydates)
#> [1] "-033-01-01"
meanmean(henriques_example$messydates)
#> [1] "1493-07-01"
medianmedian(henriques_example$messydates)
#> Warning: 1 failed to parse.
#> Warning: 1 failed to parse.
#> [1] NA
meanmode(henriques_example$messydates)
#> [1] "1492-07-02"
messyvar(henriques_example$messydates)
#> [1] 364 1094 1094 364
uncertainty(henriques_example$messydates)
#> [1] 729
Created on 2022-02-21 by the reprex package (v2.0.1)
@BBieri great job, thank you!
Just two very minor points:
I am not sure we want one measure of "uncertainty" as a range of days for all messydates contained in a variable. Each messydate may have certain levels of uncertainty, or not, and that cannot be reflected in one row of a skimr report. I would just suggest we remove the uncertainty variable from the report.
Is there a way we can make the report table and data summary a bit prettier? We can perhaps use {kablExtra}
for this or simply modify the settings within skimr.
In any case, thank you for the great job already!
Thanks again for the helpful feedback @henriquesposito ! Regarding your two points:
I agree with you that the uncertainty measure might need some tweaking and should not be interpreted as a range of days but as some sort of index allowing for ordinal comparisons between datasets. I still think that such a metric is useful for users when performing exploratory data analysis. Any input @jaeltan and @jhollway ?
Should be possible technically although it may be beyond the scope of this issue/package and might go against the "standardized reporting" approach {skimr}
takes. Let me know if you had anything specific in mind :)
Hi @BBieri , thanks for all the work you've put into this!
I'm not sure what the uncertainty metric is here supposed to achieve - does it return the total number of dates possible for the variable? If that's the case I would agree with @henriquesposito that it's not very helpful... Maybe it could return the number of date entries in the variable that have uncertainty instead?
A summary measure of uncertainty across a vector (or, rather, list of vectors) could be conceived of in different ways. In the meantime, why not something like a measure of entropy?
I finally took some time (sorry for the delay on this) to implement a basic version in the package. Currently, the implemented functions are min (the non-NA date the furthest back in time of the vector), may(the non-NA date the furthest ahead in time of the vector), and mean (the mean non-NA date of the vector). I still need to think about how we want to deal with cases where we have observations of open-ended intervals of dates.
Following issues with rendering the documentation of {manystates}
, I have discovered that the newly defined methods used to compute statistics for the {skimr}
report are very slow for larger datasets (e.g. more than a thousand dates). This is mainly due to bottlenecks in the way dates are resolved before a vector level statistic is computed on the resolved dates.
The solution I have implemented now yields a fast but relatively less informative report on messydt
vectors and gets rid of the warnings when generating documentation of dataframes with messydt
columns we have been getting in the past. Moving forward, I'll see whether it is possible to accelerate either the {skimr}
methods or the resolve functions.
Thanks for the help @henriquesposito tracking down this documentation issue! :pray:
I am putting this issue in the "awaiting deployment" pipeline since the {skimr}
template was created. Moving forward, we need to accelerate the resolve methods to be able to display the documentation faster. See issue #49.
See this resource. This will avoid unnecessary warnings when rendering the documentation for "many" packages and will allow us to get more information into the data documentation.