njtierney / naniar

Tidy data structures, summaries, and visualisations for missing data
http://naniar.njtierney.com/
Other
649 stars 54 forks source link

Add helper function to add random missingness #298

Closed njtierney closed 1 year ago

njtierney commented 2 years ago

rather than needing to do something like:

x <- 1:10
x
#>  [1]  1  2  3  4  5  6  7  8  9 10
x[sample(x = length(x), size = 5)] <- NA
x
#>  [1] NA NA  3 NA NA  6  7  8  9 NA

add_n_na <- function(x, n_na){
  x[sample(x = vctrs::vec_size(x), size = n_na)] <- NA
  x
}

x <- 1:10
x
#>  [1]  1  2  3  4  5  6  7  8  9 10
add_n_na(x, 3)
#>  [1]  1  2  3  4 NA  6  7 NA NA 10

Created on 2022-04-05 by the reprex package (v2.0.1)

Session info ``` r sessioninfo::session_info() #> ─ Session info πŸ‡»πŸ‡Ί ⏺️ πŸ•°οΈ ─────────────────────────────────────────────────── #> hash: flag: Vanuatu, record button, mantelpiece clock #> #> setting value #> version R version 4.1.3 (2022-03-10) #> os macOS Big Sur 11.2.2 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_AU.UTF-8 #> ctype en_AU.UTF-8 #> tz Australia/Melbourne #> date 2022-04-05 #> pandoc 2.17.1.1 @ /Applications/RStudio.app/Contents/MacOS/quarto/bin/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> backports 1.4.1 2021-12-13 [1] CRAN (R 4.1.1) #> cli 3.2.0 2022-02-14 [1] CRAN (R 4.1.1) #> crayon 1.5.1 2022-03-26 [1] CRAN (R 4.1.3) #> digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.1) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.0) #> evaluate 0.15 2022-02-18 [1] CRAN (R 4.1.1) #> fansi 1.0.3 2022-03-24 [1] CRAN (R 4.1.1) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.0) #> fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.1) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.1.1) #> highr 0.9 2021-04-16 [1] CRAN (R 4.1.0) #> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.1) #> knitr 1.37 2021-12-16 [1] CRAN (R 4.1.1) #> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.1) #> magrittr 2.0.2 2022-01-26 [1] CRAN (R 4.1.1) #> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.1.1) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.0) #> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.0) #> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.1.0) #> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.1.0) #> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.1.0) #> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.1.1) #> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.1) #> rlang 1.0.2 2022-03-04 [1] CRAN (R 4.1.1) #> rmarkdown 2.11 2021-09-14 [1] CRAN (R 4.1.1) #> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0) #> sessioninfo 1.2.1 2021-11-02 [1] CRAN (R 4.1.1) #> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.1) #> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.1) #> styler 1.6.2 2021-09-23 [1] CRAN (R 4.1.1) #> tibble 3.1.6 2021-11-07 [1] CRAN (R 4.1.1) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.0) #> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.0) #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.1.1) #> xfun 0.30 2022-03-02 [1] CRAN (R 4.1.1) #> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.1.1) #> #> [1] /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```
njtierney commented 1 year ago
set_prop_miss <- function(x, prop = 0.1) {
  x[sample(seq_along(x) <= prop * length(x))] <- NA
  x
}

set_n_miss <- function(x, n = 1) {
  x[sample(seq_along(x) <= n)] <- NA
  x
}

library(tidyverse)
library(naniar)

df <- tibble(
  x = rnorm(100),
  y = rpois(100, lambda = 5)
)

df
#> # A tibble: 100 Γ— 2
#>         x     y
#>     <dbl> <int>
#>  1  1.53      4
#>  2  0.281     7
#>  3  0.609    10
#>  4  2.18      7
#>  5 -0.256     9
#>  6 -0.565     5
#>  7  0.827     4
#>  8 -0.878     4
#>  9 -1.14      8
#> 10  1.17      8
#> # … with 90 more rows

set_prop_miss(df$x, 0.1) %>% prop_miss()
#> [1] 0.1
set_prop_miss(df$x, 0.5) %>% prop_miss()
#> [1] 0.5
set_prop_miss(df$x, 0.75) %>% prop_miss()
#> [1] 0.75

set_n_miss(df$x, 2) %>% n_miss()
#> [1] 2
set_n_miss(df$x, 10) %>% n_miss()
#> [1] 10
set_n_miss(df$x, 50) %>% n_miss()
#> [1] 50

Created on 2023-01-31 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.1 (2022-06-23) #> os macOS Monterey 12.3.1 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Australia/Hobart #> date 2023-01-31 #> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0) #> backports 1.4.1 2021-12-13 [1] CRAN (R 4.2.0) #> broom 1.0.2 2022-12-15 [1] CRAN (R 4.2.0) #> cellranger 1.1.0 2016-07-27 [1] CRAN (R 4.2.0) #> cli 3.6.0 2023-01-09 [1] CRAN (R 4.2.0) #> colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.2.0) #> crayon 1.5.2 2022-09-29 [1] CRAN (R 4.2.0) #> DBI 1.1.3 2022-06-18 [1] CRAN (R 4.2.0) #> dbplyr 2.3.0 2023-01-16 [1] CRAN (R 4.2.0) #> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.0) #> dplyr * 1.1.0 2023-01-29 [1] CRAN (R 4.2.1) #> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.2.0) #> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.0) #> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0) #> forcats * 0.5.2 2022-08-19 [1] CRAN (R 4.2.0) #> fs 1.6.0 2023-01-23 [1] CRAN (R 4.2.0) #> gargle 1.2.1 2022-09-08 [1] CRAN (R 4.2.0) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0) #> ggplot2 * 3.4.0 2022-11-04 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> googledrive 2.0.0 2021-07-08 [1] CRAN (R 4.2.0) #> googlesheets4 1.0.1 2022-08-13 [1] CRAN (R 4.2.0) #> gtable 0.3.1 2022-09-01 [1] CRAN (R 4.2.0) #> haven 2.5.1 2022-08-22 [1] CRAN (R 4.2.0) #> hms 1.1.2 2022-08-19 [1] CRAN (R 4.2.0) #> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.0) #> httr 1.4.4 2022-08-17 [1] CRAN (R 4.2.0) #> jsonlite 1.8.4 2022-12-06 [1] CRAN (R 4.2.0) #> knitr 1.41.9 2023-01-20 [1] https://yihui.r-universe.dev (R 4.2.2) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0) #> lubridate 1.9.1 2023-01-24 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> modelr 0.1.10 2022-11-11 [1] CRAN (R 4.2.0) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0) #> naniar * 0.6.1.9001 2023-01-31 [1] local #> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> purrr * 1.0.1 2023-01-10 [1] CRAN (R 4.2.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> readr * 2.1.3 2022-10-01 [1] CRAN (R 4.2.0) #> readxl 1.4.1 2022-08-17 [1] CRAN (R 4.2.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0) #> rlang 1.0.6 2022-09-24 [1] CRAN (R 4.2.0) #> rmarkdown 2.20 2023-01-19 [1] CRAN (R 4.2.0) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0) #> rvest 1.0.3 2022-08-19 [1] CRAN (R 4.2.0) #> scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> stringi 1.7.12 2023-01-11 [1] CRAN (R 4.2.0) #> stringr * 1.5.0 2022-12-02 [1] CRAN (R 4.2.0) #> styler 1.9.0 2023-01-15 [1] CRAN (R 4.2.0) #> tibble * 3.1.8 2022-07-22 [1] CRAN (R 4.2.0) #> tidyr * 1.3.0 2023-01-24 [1] CRAN (R 4.2.0) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0) #> tidyverse * 1.3.2 2022-07-18 [1] CRAN (R 4.2.0) #> timechange 0.2.0 2023-01-11 [1] CRAN (R 4.2.0) #> tzdb 0.3.0 2022-03-28 [1] CRAN (R 4.2.0) #> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.2.0) #> vctrs 0.5.2 2023-01-23 [1] CRAN (R 4.2.0) #> visdat 0.6.0.9000 2022-12-13 [1] local #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.36 2022-12-21 [1] CRAN (R 4.2.0) #> xml2 1.3.3 2021-11-30 [1] CRAN (R 4.2.0) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```

Note that add_n_miss and add_prop_miss are already functions for adding helper columns on the proportion and number of missing values to a dataset