When teaching examples using R, instructors often using nice datasets - but these aren't very realistic, and aren't what students will later encounter in the real world. Real datasets have typos, missing values encoded in strange ways, and weird spaces. The {messy} R package takes a clean dataset, and randomly adds these things in - giving students the opportunity to practice their data cleaning and wrangling skills without having to change all of your examples.
Install from GitHub using:
remotes::install_github("nrennie/messy")
messy()
set.seed(1234)
messy(ToothGrowth[1:10,])
len supp dose
1 4.2 VC 0.5
2 11.5 <NA> <NA>
3 7.3 VC 0.5
4 5.8 (VC 0.5
5 6.4 VC <NA>
6 10 VC 0.5
7 11.2 <NA> 0.5
8 11.2 VC 0.5
9 5.2 VC 0.5
10 7 VC 0.5
Increase how messy the data is:
set.seed(1234)
messy(ToothGrowth[1:10,], messiness = 0.7)
len supp dose
1 <NA> <NA> <NA>
2 11.5 <NA> <NA>
3 <NA> <NA> <NA>
4 5.8 <NA> <NA>
5 <NA> .v*c <NA>
6 <NA> <NA> <NA>
7 <NA> <NA> <NA>
8 <NA> <NA> 0.5
9 <NA> v@c <NA>
10 <NA> <NA> <NA>
add_whitespace()
Randomly adds a whitespace to the ends of some values, meaning that numeric columns may be converted to characters:
set.seed(1234)
add_whitespace(ToothGrowth[1:10,])
len supp dose
1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10 VC 0.5
7 11.2 VC 0.5
8 11.2 VC 0.5
9 5.2 VC 0.5
10 7 VC 0.5
Apply to only some columns:
set.seed(1234)
add_whitespace(ToothGrowth[1:10,], cols = "supp")
len supp dose
1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
7 11.2 VC 0.5
8 11.2 VC 0.5
9 5.2 VC 0.5
10 7.0 VC 0.5
change_case()
Randomly switches the case between upper case, lower case, and no change of character or factor columns:
set.seed(1234)
change_case(ToothGrowth[1:10,], messiness = 0.5)
len supp dose
1 4.2 vc 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
7 11.2 vc 0.5
8 11.2 vc 0.5
9 5.2 VC 0.5
10 7.0 VC 0.5
By default, the case of the entire string is changes. Alternatively, you can specify to change the case of each individual letter:
set.seed(1234)
change_case(ToothGrowth[1:10,], messiness = 0.5, case_type = "letter")
len supp dose
1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 vC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
7 11.2 Vc 0.5
8 11.2 Vc 0.5
9 5.2 VC 0.5
10 7.0 VC 0.5
add_special_chars()
Randomly add special characters to character strings:
set.seed(1234)
add_special_chars(ToothGrowth[1:10,])
len supp dose
1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
4 5.8 (VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
7 11.2 VC 0.5
8 11.2 VC 0.5
9 5.2 VC 0.5
10 7.0 VC 0.5
make_missing()
Randomly make some values missing using NA
:
set.seed(1234)
make_missing(ToothGrowth[1:10,])
len supp dose
1 4.2 VC 0.5
2 11.5 VC NA
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
7 NA VC 0.5
8 11.2 VC NA
9 5.2 VC 0.5
10 7.0 VC 0.5
Add a different missing value representation for some columns:
set.seed(1234)
make_missing(ToothGrowth[1:10,], cols = "supp", missing = "999")
len supp dose
1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
7 11.2 999 0.5
8 11.2 VC 0.5
9 5.2 VC 0.5
10 7.0 VC 0.5
messy_colnames()
Create messy column names:
set.seed(1234)
messy_colnames(ToothGrowth[1:10,])
)len s(upp dose
1 4.2 VC 0.5
2 11.5 VC 0.5
3 7.3 VC 0.5
4 5.8 VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
7 11.2 VC 0.5
8 11.2 VC 0.5
9 5.2 VC 0.5
10 7.0 VC 0.5
You can pipe together multiple functions to create custom messy transformations:
set.seed(1234)
ToothGrowth[1:10,] |>
make_missing(cols = "supp", missing = " ") |>
make_missing(cols = c("len", "dose"), missing = c(NA, 999)) |>
add_whitespace(cols = "supp", messiness = 0.5) |>
add_special_chars(cols = "supp")
len supp dose
1 4.2 VC 0.5
2 11.5 VC NA
3 7.3 VC 0.5
4 5.8 *VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
7 11.2 0.5
8 11.2 V#C NA
9 5.2 !VC 0.5
10 7.0 VC* 0.5
If you're adding messy_colnames()
to a chain (and you specify only some columns in other functions), make sure messy_colnames()
comes at the end:
set.seed(1234)
ToothGrowth[1:10,] |>
make_missing(cols = "supp", missing = " ") |>
make_missing(cols = c("len", "dose"), missing = c(NA, 999)) |>
add_whitespace(cols = "supp", messiness = 0.5) |>
add_special_chars(cols = "supp") |>
messy_colnames()
!l_e)n S^UPP d^o)se
1 4.2 VC 0.5
2 11.5 VC NA
3 7.3 VC 0.5
4 5.8 *VC 0.5
5 6.4 VC 0.5
6 10.0 VC 0.5
7 11.2 0.5
8 11.2 V#C NA
9 5.2 !VC 0.5
10 7.0 VC* 0.5
Otherwise, the column names you try to select may no longer exist!