Existing packages for working with dates in R expect them to be tidy.
That is, they should be in or coercible to the standard yyyy-mm-dd
format.
But dates are often messy. Sometimes we only know the year when something happened, leaving other components of the date, such as the month or day, unspecified. This is often the case with historical dates, for instance. Sometimes we can only say approximately when an event occurred, that it occurred before or after a certain date, or we recognise that our best estimate comes from a dubious source. Other times there exists a set or range of possible dates for an event.
Although researchers generally recognise this messiness, many feel
expected to force artificial precision or unfortunate imprecision on
temporal data to proceed with analysis. For example, if we only know
something happened in 2021
, then we might revert to a panel data
design even if greater precision is available, or opt to replace this
date with the start of that year (2021-01-01
), assuming that erring on
the earlier (or later) side is more justifiable than a random date
within that month or year.
However, this can create inferential issues when timing or sequence is
important. {messydates}
assists with this problem by retaining and
working with various kinds of date imprecision.
{messydates}
implements for R the Extended Date/Time Format (EDTF)
annotations set by the International Organization for Standardization
(ISO) outlined in ISO
8601-2_2019(E). {messydates}
introduces a new mdate
class that embeds these annotations, and offers
a set of methods for constructing and coercing into and from the mdate
class, as well as tools for working with such ‘messy’ dates.
pkg_comparison <- tibble::tribble(~Example, ~OriginalDate,
"Normal date", "2012-01-01",
"Future date", "2599-12-31",
"Historical date", "476",
"Era date", "33 BC",
"Written date", "First of February, two thousand and twelve",
"DMY date", "10-31-2012",
"MDY date", "31-10-2012",
"Wrongly specified date", "2012-31-10",
"Approximate date", "2012-01-12~",
"Uncertain date", "2012-01-01?",
"Unspecified date", "2012-01",
"Censored date", "..2012-01-12",
"Range of dates", "2012-11-01:2012-12-01",
"Set of dates", "2012-5-26, 2012-11-19, 2012-12-4") %>%
dplyr::mutate(base = as.Date(OriginalDate),
lubridate = suppressWarnings(lubridate::as_date(OriginalDate)),
messydates = messydates::as_messydate(OriginalDate))
Example | OriginalDate | base | lubridate | messydates |
---|---|---|---|---|
Normal date | 2012-01-01 | 2012-01-01 | 2012-01-01 | 2012-01-01 |
Future date | 2599-12-31 | 2599-12-31 | 2599-12-31 | 2599-12-31 |
Historical date | 476 | NA | NA | 0476 |
Era date | 33 BC | NA | NA | -0033 |
Written date | First of February, two thousand and twelve | NA | NA | 2012-02-01 |
DMY date | 10-31-2012 | NA | NA | 2012-10-31 |
MDY date | 31-10-2012 | 0031-10-20 | NA | 2012-10-31 |
Wrongly specified date | 2012-31-10 | NA | NA | 2012-10-31 |
Approximate date | 2012-01-12~ | 2012-01-12 | 2012-01-12 | 2012-01-12~ |
Uncertain date | 2012-01-01? | 2012-01-01 | 2012-01-01 | 2012-01-01? |
Unspecified date | 2012-01 | NA | 2020-12-01 | 2012-01 |
Censored date | ..2012-01-12 | NA | 2012-01-12 | ..2012-01-12 |
Range of dates | 2012-11-01:2012-12-01 | 2012-11-01 | 2012-11-01 | 2012-11-01..2012-12-01 |
Set of dates | 2012-5-26, 2012-11-19, 2012-12-4 | 2012-05-26 | NA | {2012-05-26,2012-11-19,2012-12-04} |
As can be seen in the table above, other date/time packages in R do not handle ‘messy’ dates well. Normal “yyyy-mm-dd” structures or other date formats that can easily be coerced into this structure are usually not a problem.
However, some syntaxes are entirely ignored, such as historical dates and dates from other eras (e.g. BCE), as well as written dates, frequently used in historical texts or treaties.
Other times, existing packages return a date, but strip away any annotations that express uncertainty or approximateness, introducing artificial precision.
And sometimes returning only a single date means ignoring other
information included. We see this here in how only the end of the
censored date, only the start of the date range, or the first in the set
of dates is returned. Sometimes date components even seem guessed, such
as how 2021-01
(January 2021) is assumed to be 1 December 2021 by
{lubridate}
.
So only {messydates}
enables researchers to retain all this
information. But most analysis does still expect some precision in dates
to work.
The first way that {messydates}
assists researchers that use dates in
mdate
class is to provide methods for converting back into common date
classes such as Date
, POSIXct
, and POSIXlt
. It is thus fully
compatible with packages such as {lubridate}
and {anydate}
.
As messy date annotations can indicate multiple possible dates,
{messydates}
allows e.g. ranges or sets of dates to be unpacked or
expanded into all compatible dates.
Since most methods of analysis or modelling expect single date
observations, we offer ways to resolve this multiplicity when coercing
mdate
-class objects into other date formats. For example, researcher
might explicitly choose to favour the min()
, max()
, mean()
,
median()
, or even a random()
date. This greatly facilitates research
transparency by demanding a conscious choice from researchers, as well
as supporting robustness checks by enabling description or inference
across dates compatible with the messy annotated date.
resolve_mdate <- pkg_comparison %>%
dplyr::select(messydates) %>%
dplyr::mutate(min = as.Date(messydates, min),
median = as.Date(messydates, median),
max = as.Date(messydates, max))
messydates | min | median | max |
---|---|---|---|
2012-01-01 | 2012-01-01 | 2012-01-01 | 2012-01-01 |
2599-12-31 | 2599-12-31 | 2599-12-31 | 2599-12-31 |
0476 | 0476-01-01 | 0476-07-02 | 0476-12-31 |
-0033 | -033-01-01 | -033-07-02 | -033-12-31 |
2012-02-01 | 2012-02-01 | 2012-02-01 | 2012-02-01 |
2012-10-31 | 2012-10-31 | 2012-10-31 | 2012-10-31 |
2012-10-31 | 2012-10-31 | 2012-10-31 | 2012-10-31 |
2012-10-31 | 2012-10-31 | 2012-10-31 | 2012-10-31 |
2012-01-12~ | 2012-01-12 | 2012-01-12 | 2012-01-12 |
2012-01-01? | 2012-01-01 | 2012-01-01 | 2012-01-01 |
2012-01 | 2012-01-01 | 2012-01-16 | 2012-01-31 |
..2012-01-12 | 2012-01-12 | 2012-01-12 | 2012-01-12 |
2012-11-01..2012-12-01 | 2012-11-01 | 2012-11-16 | 2012-12-01 |
{2012-05-26,2012-11-19,2012-12-04} | 2012-05-26 | 2012-11-19 | 2012-12-04 |
As can be seen in the table above, all ‘precise’ dates are respected as
such, and returned no matter what ‘resolution’ function is given. But
for messy dates, the choice of function can make a difference. Where
only a year is given, e.g. 0476
or -0033
, we draw from all the days
in the year. The minimum is the first of January and the maximum the
31st of December. Dates are also drawn from a set or range of dates when
given.
When only an approximate or censored date is known, then depending on whether the whole date or just a component of the date is annotated, then a range of dates is imputed based on some window (by default 3 years, months, or days), and then a precise date is resolved from that.
This translation via an expanded list of compatible dates is fast, robust, and extensible, allowing researchers to use messy dates in an analytic strategy that uses any other package.
Please see the cheat sheet and the messydates
website for more information
about how to use {messydates}
.
The easiest way to install {messydates}
is directly from CRAN:
install.packages("messydates")
However, you may also install the development version from GitHub.
# install.packages("remotes")
remotes::install_github("globalgov/messydates")
The package was developed as part of the PANARCHIC project, which studies the effects of network and power on how quickly states join, reform, or create international institutions by examining the historical dynamics of institutional networks from different domains.
The PANARCHIC project is funded by the Swiss National Science Foundation (SNSF). For more information on current projects of the Geneva Global Governance Observatory, please see our Github website.