njtierney / naniar

Tidy data structures, summaries, and visualisations for missing data
http://naniar.njtierney.com/
Other
650 stars 54 forks source link

If you want to look for a "?" as missing character, the argument must be "\\?" #247

Closed edples closed 1 year ago

edples commented 4 years ago

Hello,

If you want to count the "?"s as missing data in your dataframe, you should use "\\?" as argument, otherwise the result will be 100% of the "?"s in your data. Example:

`data.frame':   2671 obs. of  21 variables:
 $ x0 : chr  "b" "a" "a" "b" ...
 $ x1 : num  30.8 58.7 24.5 27.8 20.2 ...
 $ x2 : num  NA 4.46 0.5 1.54 5.62 ...
 $ x3 : chr  "u" "u" "u" "u" ...
 $ x4 : chr  "g" "g" "g" "g" ...
 $ x5 : chr  "w" "q" "q" "w" ...
 $ x6 : chr  "v" "h" "h" "v" ...
 $ x7 : num  1.25 3.04 1.5 3.75 1.71 ...
 $ x8 : chr  "t" "t" "t" "t" ...
 $ x9 : chr  "t" "t" "f" "t" ...
 $ x10: chr  "t" "6" "f" "5" ...
 $ x11: chr  "f" "f" "f" "t" ...
 $ x12: chr  "g" "g" "g" "g" ...
 $ x13: num  202 43 280 100 120 360 164 80 180 52 ...
 $ x14: num  NA 560 824 3 NA ...
 $ x20: chr  "t" "t" "t" "t" ...
 $ x17: num  116.9 225.6 92.1 104.2 77.9 ...
 $ x18: num  0.579 25.41 2.317 8.045 31.111 ...
 $ x19: num  202000 43000 280000 100000 120000 360000 164000 80000 180000 52000 ...
 $ x16: chr  "f" "f" "f" "f" ...
 $ y  : chr  "good" "good" "good" "good" ...

miss_scan_count(data = training, search = list("?")), n=22) will output this:

Variable     n
   <chr>    <int>
 1 x0        2671
 2 x1        2662
 3 x2        2546
 4 x3        2671
 5 x4        2671
 6 x5        2671
 7 x6        2671
 8 x7        2477
 9 x8        2671
10 x9        2671
11 x10       2671
12 x11       2671
13 x12       2671
14 x13       1961
15 x14       1590
16 x20       2671
17 x17       2662
18 x18       2671
19 x19       1961
20 x16       2671
21 y         2671`

The correct use of the function must be: print(miss_scan_count(data = training, search = list("\\?")), n=22)

njtierney commented 1 year ago

Hi there, thanks for noting this, much appreciated!

I've updated the documentation to describe how to search for things like "?" and co, it now looks like this:

library(naniar)

dat_ms <- tibble::tribble(~x,  ~y,    ~z,  ~specials,
                         1,   "A",   -100, "?",
                         3,   "N/A", -99,  "!",
                         NA,  NA,    -98,  ".",
                         -99, "E",   -101, "*",
                         -98, "F",   -1,  "-")

miss_scan_count(dat_ms,-99)
#> # A tibble: 4 × 2
#>   Variable     n
#>   <chr>    <int>
#> 1 x            1
#> 2 y            0
#> 3 z            1
#> 4 specials     0
miss_scan_count(dat_ms,c(-99,-98))
#> # A tibble: 4 × 2
#>   Variable     n
#>   <chr>    <int>
#> 1 x            2
#> 2 y            0
#> 3 z            2
#> 4 specials     0
miss_scan_count(dat_ms,c("-99","-98","N/A"))
#> # A tibble: 4 × 2
#>   Variable     n
#>   <chr>    <int>
#> 1 x            2
#> 2 y            1
#> 3 z            2
#> 4 specials     0
miss_scan_count(dat_ms, "\\?")
#> # A tibble: 4 × 2
#>   Variable     n
#>   <chr>    <int>
#> 1 x            0
#> 2 y            0
#> 3 z            0
#> 4 specials     1
miss_scan_count(dat_ms, "\\!")
#> # A tibble: 4 × 2
#>   Variable     n
#>   <chr>    <int>
#> 1 x            0
#> 2 y            0
#> 3 z            0
#> 4 specials     1
miss_scan_count(dat_ms, "\\.")
#> # A tibble: 4 × 2
#>   Variable     n
#>   <chr>    <int>
#> 1 x            0
#> 2 y            0
#> 3 z            0
#> 4 specials     1
miss_scan_count(dat_ms, "\\*")
#> # A tibble: 4 × 2
#>   Variable     n
#>   <chr>    <int>
#> 1 x            0
#> 2 y            0
#> 3 z            0
#> 4 specials     1
miss_scan_count(dat_ms, "-")
#> # A tibble: 4 × 2
#>   Variable     n
#>   <chr>    <int>
#> 1 x            2
#> 2 y            0
#> 3 z            5
#> 4 specials     1
miss_scan_count(dat_ms,common_na_strings)
#> # A tibble: 4 × 2
#>   Variable     n
#>   <chr>    <int>
#> 1 x            4
#> 2 y            4
#> 3 z            5
#> 4 specials     5

Created on 2023-04-10 with reprex v2.0.2

Session info ``` r sessioninfo::session_info() #> ─ Session info ─────────────────────────────────────────────────────────────── #> setting value #> version R version 4.2.3 (2023-03-15) #> os macOS Ventura 13.2 #> system aarch64, darwin20 #> ui X11 #> language (EN) #> collate en_US.UTF-8 #> ctype en_US.UTF-8 #> tz Australia/Hobart #> date 2023-04-10 #> pandoc 2.19.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown) #> #> ─ Packages ─────────────────────────────────────────────────────────────────── #> package * version date (UTC) lib source #> cli 3.6.0 2023-01-09 [1] CRAN (R 4.2.0) #> colorspace 2.1-0 2023-01-23 [1] CRAN (R 4.2.0) #> digest 0.6.31 2022-12-11 [1] CRAN (R 4.2.0) #> dplyr 1.1.1 2023-03-22 [1] CRAN (R 4.2.0) #> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.0) #> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.0) #> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.2.0) #> fs 1.6.1 2023-02-06 [1] CRAN (R 4.2.0) #> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.0) #> ggplot2 3.4.1 2023-02-10 [1] CRAN (R 4.2.0) #> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0) #> gtable 0.3.1 2022-09-01 [1] CRAN (R 4.2.0) #> htmltools 0.5.4 2022-12-07 [1] CRAN (R 4.2.0) #> knitr 1.42 2023-01-25 [1] CRAN (R 4.2.0) #> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0) #> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0) #> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.2.0) #> naniar * 1.0.0.9000 2023-04-10 [1] local #> pillar 1.8.1 2022-08-19 [1] CRAN (R 4.2.0) #> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0) #> purrr 1.0.1 2023-01-10 [1] CRAN (R 4.2.0) #> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.0) #> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0) #> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0) #> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.0) #> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0) #> reprex 2.0.2 2022-08-17 [1] CRAN (R 4.2.0) #> rlang 1.1.0 2023-03-14 [1] CRAN (R 4.2.0) #> rmarkdown 2.20 2023-01-19 [1] CRAN (R 4.2.0) #> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.0) #> scales 1.2.1 2022-08-20 [1] CRAN (R 4.2.0) #> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.2.0) #> styler 1.9.0 2023-01-15 [1] CRAN (R 4.2.0) #> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.2.0) #> tidyr 1.3.0 2023-01-24 [1] CRAN (R 4.2.0) #> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0) #> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.0) #> vctrs 0.6.1 2023-03-22 [1] CRAN (R 4.2.0) #> visdat 0.6.0 2023-02-02 [1] local #> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0) #> xfun 0.37 2023-01-31 [1] CRAN (R 4.2.0) #> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.0) #> #> [1] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library #> #> ────────────────────────────────────────────────────────────────────────────── ```