The goal of butterfly is to aid in the verification of continually updating timeseries data, where we expect new values over time, but want to ensure previous data remains unchanged, and timesteps remain continuous.
Data previously recorded could change for a number of reasons, such as discovery of an error in model code, a change in methodology or instrument recalibration. Monitoring data sources for these changes is not always possible.
Unnoticed changes in previous data could have unintended consequences, such as invalidating a published dataset’s Digital Object Identfier (DOI), or altering future predictions if used as input in forecasting models.
This package provides functionality that can be used as part of a data pipeline, to check and flag changes to previous data to prevent changes going unnoticed.
You can install the development version of butterfly from GitHub with:
# install.packages("devtools")
devtools::install_github("thomaszwagerman/butterfly")
The butterfly package contains the following functions:
butterfly::loupe()
- examines in detail whether previous values have
changed, and returns TRUE/FALSE for no change/change.butterfly::catch()
- returns rows which contain previously changed
values in a dataframe.butterfly::release()
- drops rows which contain previously changed
values, and returns a dataframe containing new and unchanged rows.butterfly::create_object_list()
- returns a list of objects required
by all of loupe()
, catch()
and release()
. Contains underlying
functionality.butterfly::timeline()
- check if a timeseries is continuous between
timesteps.butterfly::timeline_group()
- group distinct, but continuous
sequences of a timeseries.There are also dummy datasets, which a fictional and purely to demonstrate butterfly functionality:
butterflycount
- a list of monthly dataframes, which contain
fictional butterfly counts for a given date.forestprecipitation
- a list of monthly dataframes, which contain
fictional daily precipitation measurements for a given date.This is a basic example which shows you how to use butterfly:
library(butterfly)
# Imagine a continually updated dataset that starts in January and is updated once a month
butterflycount$january
#> time count
#> 1 2024-01-01 22
#> 2 2023-12-01 55
#> 3 2023-11-01 11
# In February an additional row appears, all previous data remains the same
butterflycount$february
#> time count
#> 1 2024-02-01 17
#> 2 2024-01-01 22
#> 3 2023-12-01 55
#> 4 2023-11-01 11
# In March an additional row appears again
# ...but a previous value has unexpectedly changed
butterflycount$march
#> time count
#> 1 2024-03-01 23
#> 2 2024-02-01 17
#> 3 2024-01-01 22
#> 4 2023-12-01 55
#> 5 2023-11-01 18
We can use butterfly::loupe()
to examine in detail whether previous
values have changed.
butterfly::loupe(
butterflycount$february,
butterflycount$january,
datetime_variable = "time"
)
#> The following rows are new in 'df_current':
#> time count
#> 1 2024-02-01 17
#> ✔ And there are no differences with previous data.
#> [1] TRUE
butterfly::loupe(
butterflycount$march,
butterflycount$february,
datetime_variable = "time"
)
#> The following rows are new in 'df_current':
#> time count
#> 1 2024-03-01 23
#>
#> ℹ The following values have changes from the previous data.
#> old vs new
#> count
#> old[1, ] 17
#> old[2, ] 22
#> old[3, ] 55
#> - old[4, ] 18
#> + new[4, ] 11
#>
#> `old$count`: 17.0 22.0 55.0 18.0
#> `new$count`: 17.0 22.0 55.0 11.0
#> [1] FALSE
butterfly::loupe()
uses dplyr::semi_join()
to match the new and old
objects using a common unique identifier, which in a timeseries will be
the timestep. waldo::compare()
is then used to compare these and
provide a detailed report of the differences.
butterfly
follows the waldo
philosophy of erring on the side of
providing too much information, rather than too little. It will give a
detailed feedback message on the status between two objects.
You might want to return changed rows as a dataframe, or drop them
altogether. For this butterfly::catch()
and butterfly::release()
are
provided.
Here, butterfly::catch()
only returns rows which have changed from
the previous version. It will not return new rows.
df_caught <- butterfly::catch(
butterflycount$march,
butterflycount$february,
datetime_variable = "time"
)
#> The following rows are new in 'df_current':
#> time count
#> 1 2024-03-01 23
#>
#> ℹ The following values have changes from the previous data.
#> old vs new
#> count
#> old[1, ] 17
#> old[2, ] 22
#> old[3, ] 55
#> - old[4, ] 18
#> + new[4, ] 11
#>
#> `old$count`: 17.0 22.0 55.0 18.0
#> `new$count`: 17.0 22.0 55.0 11.0
#>
#> ℹ Only these rows are returned.
df_caught
#> time count
#> 1 2023-11-01 18
Conversely, butterfly::release()
drops all rows which had changed from
the previous version. Note it retains new rows, as these were expected.
df_released <- butterfly::release(
butterflycount$march,
butterflycount$february,
datetime_variable = "time"
)
#> The following rows are new in 'df_current':
#> time count
#> 1 2024-03-01 23
#>
#> ℹ The following values have changes from the previous data.
#> old vs new
#> count
#> old[1, ] 17
#> old[2, ] 22
#> old[3, ] 55
#> - old[4, ] 18
#> + new[4, ] 11
#>
#> `old$count`: 17.0 22.0 55.0 18.0
#> `new$count`: 17.0 22.0 55.0 11.0
#>
#> ℹ These will be dropped, but new rows are included.
df_released
#> time count
#> 1 2024-03-01 23
#> 2 2024-02-01 17
#> 3 2024-01-01 22
#> 4 2023-12-01 55
The butterfly package was created for a specific use case of handling continuously updating/overwritten timeseries data, where previous values may change without notice.
There are other R packages and functions which handle object comparison,
which may suit your specific needs better. Below we describe their
overlap and differences to butterfly
:
butterfly
uses
waldo::compare()
in every function to provide a report on
difference. There is therefore significant overlap, however
butterfly
builds on waldo
by providing the functionality of
comparing objects where we expect some changes, with previous versions
but not others. butterfly
also provides extra user feedback to
provide clarity on what it is and isn’t comparing, due to the nature
of comparing only “matched” rows.waldo
, but
specifically for data frames, diffdf
provides the ability to compare
data frames directly. We could have used diffdf::diffdf()
in our
case, but we prefer waldo
’s more explicit and clear user feedback.
That said, there is significant overlap in functionality:
butterfly::loupe()
and diffdf::diffdf_has_issues()
both provide a
TRUE/FALSE difference check, while diffdf::diffdf_issue_rows()
and
butterfly::catch()
both return the rows where changes have occurred.
However, it lacks the flexibility of butterfly
to compare object
where we expect some changes, but not others.assertr
provides assertion functionality that can be used as part of a
pipeline, and test assertions on a particular dataset, but it does not
offer tools for comparison. We do highly recommend using assertr
for
checks, prior to using butterfly
, as any data quality issues will be
caught first.daquiri
provides
tools to check data quality and visually inspect timeseries data. It
is also quality assurance package for timeseries, but has a very
different purpose to butterfly
.Other functions include all.equal()
(base R) or
dplyr’s setdiff()
.
butterfly
in productionRead more about how butterfly
is used in an operational data
pipeline
to verify a continually updated and published dataset.