thomaszwagerman / butterfly

Verification of continually updating timeseries data where we expect new values, but want to ensure previous data remains unchanged.
https://thomaszwagerman.github.io/butterfly/
Other
2 stars 0 forks source link
data-versioning qaqc r r-package rstats timeseries verification

butterfly butterfly website

R-CMD-check Codecov test
coverage Lifecycle:
stable Project Status: Active – The project has reached a stable, usable
state and is being actively
developed. pkgcheck

The goal of butterfly is to aid in the verification of continually updating timeseries data, where we expect new values over time, but want to ensure previous data remains unchanged, and timesteps remain continuous.

An illustration of continually updating timeseries data where a previous value unexpectedly changes.

An illustration of continually updating timeseries data where a previous value unexpectedly changes.

Data previously recorded could change for a number of reasons, such as discovery of an error in model code, a change in methodology or instrument recalibration. Monitoring data sources for these changes is not always possible.

Unnoticed changes in previous data could have unintended consequences, such as invalidating a published dataset’s Digital Object Identfier (DOI), or altering future predictions if used as input in forecasting models.

This package provides functionality that can be used as part of a data pipeline, to check and flag changes to previous data to prevent changes going unnoticed.

Installation

You can install the development version of butterfly from GitHub with:

# install.packages("devtools")
devtools::install_github("thomaszwagerman/butterfly")

Overview

The butterfly package contains the following functions:

There are also dummy datasets, which a fictional and purely to demonstrate butterfly functionality:

Examples

This is a basic example which shows you how to use butterfly:

library(butterfly)

# Imagine a continually updated dataset that starts in January and is updated once a month
butterflycount$january
#>         time count
#> 1 2024-01-01    22
#> 2 2023-12-01    55
#> 3 2023-11-01    11

# In February an additional row appears, all previous data remains the same
butterflycount$february
#>         time count
#> 1 2024-02-01    17
#> 2 2024-01-01    22
#> 3 2023-12-01    55
#> 4 2023-11-01    11

# In March an additional row appears again
# ...but a previous value has unexpectedly changed
butterflycount$march
#>         time count
#> 1 2024-03-01    23
#> 2 2024-02-01    17
#> 3 2024-01-01    22
#> 4 2023-12-01    55
#> 5 2023-11-01    18

We can use butterfly::loupe() to examine in detail whether previous values have changed.

butterfly::loupe(
  butterflycount$february,
  butterflycount$january,
  datetime_variable = "time"
)
#> The following rows are new in 'df_current': 
#>         time count
#> 1 2024-02-01    17
#> ✔ And there are no differences with previous data.
#> [1] TRUE

butterfly::loupe(
  butterflycount$march,
  butterflycount$february,
  datetime_variable = "time"
)
#> The following rows are new in 'df_current': 
#>         time count
#> 1 2024-03-01    23
#> 
#> ℹ The following values have changes from the previous data.
#> old vs new
#>            count
#>   old[1, ]    17
#>   old[2, ]    22
#>   old[3, ]    55
#> - old[4, ]    18
#> + new[4, ]    11
#> 
#> `old$count`: 17.0 22.0 55.0 18.0
#> `new$count`: 17.0 22.0 55.0 11.0
#> [1] FALSE

butterfly::loupe() uses dplyr::semi_join() to match the new and old objects using a common unique identifier, which in a timeseries will be the timestep. waldo::compare() is then used to compare these and provide a detailed report of the differences.

butterfly follows the waldo philosophy of erring on the side of providing too much information, rather than too little. It will give a detailed feedback message on the status between two objects.

Using butterfly for data wrangling

You might want to return changed rows as a dataframe, or drop them altogether. For this butterfly::catch() and butterfly::release() are provided.

Here, butterfly::catch() only returns rows which have changed from the previous version. It will not return new rows.

df_caught <- butterfly::catch(
  butterflycount$march,
  butterflycount$february,
  datetime_variable = "time"
)
#> The following rows are new in 'df_current': 
#>         time count
#> 1 2024-03-01    23
#> 
#> ℹ The following values have changes from the previous data.
#> old vs new
#>            count
#>   old[1, ]    17
#>   old[2, ]    22
#>   old[3, ]    55
#> - old[4, ]    18
#> + new[4, ]    11
#> 
#> `old$count`: 17.0 22.0 55.0 18.0
#> `new$count`: 17.0 22.0 55.0 11.0
#> 
#> ℹ Only these rows are returned.

df_caught
#>         time count
#> 1 2023-11-01    18

Conversely, butterfly::release() drops all rows which had changed from the previous version. Note it retains new rows, as these were expected.

df_released <- butterfly::release(
  butterflycount$march,
  butterflycount$february,
  datetime_variable = "time"
)
#> The following rows are new in 'df_current': 
#>         time count
#> 1 2024-03-01    23
#> 
#> ℹ The following values have changes from the previous data.
#> old vs new
#>            count
#>   old[1, ]    17
#>   old[2, ]    22
#>   old[3, ]    55
#> - old[4, ]    18
#> + new[4, ]    11
#> 
#> `old$count`: 17.0 22.0 55.0 18.0
#> `new$count`: 17.0 22.0 55.0 11.0
#> 
#> ℹ These will be dropped, but new rows are included.

df_released
#>         time count
#> 1 2024-03-01    23
#> 2 2024-02-01    17
#> 3 2024-01-01    22
#> 4 2023-12-01    55

Relevant packages and functions

The butterfly package was created for a specific use case of handling continuously updating/overwritten timeseries data, where previous values may change without notice.

There are other R packages and functions which handle object comparison, which may suit your specific needs better. Below we describe their overlap and differences to butterfly:

Other functions include all.equal() (base R) or dplyr’s setdiff().

butterfly in production

Read more about how butterfly is used in an operational data pipeline to verify a continually updated and published dataset.