Open ablack3 opened 6 years ago
Here is an idea. Can I supply visdat with a table containing the missing value pattern in the database table and get a nice plot?
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(wakefield))
df <- r_data_frame(n= 100, id, age, height) %>%
r_na(prob = .4)
con <- DBI::dbConnect(RSQLite::SQLite(), path = ":memory:")
tbl_dbi <- copy_to(con, df, "df")
tbl_dbi
#> # Source: table<df> [?? x 3]
#> # Database: sqlite 3.19.3 []
#> ID Age Height
#> <chr> <int> <dbl>
#> 1 001 NA 71.0
#> 2 002 21 68.0
#> 3 003 NA 62.0
#> 4 004 34 NA
#> 5 005 NA NA
#> 6 006 20 NA
#> 7 007 31 NA
#> 8 008 24 72.0
#> 9 009 20 74.0
#> 10 010 NA 70.0
#> # ... with more rows
# function to pull info from database to feed into visdat
count_missing <- function(tbl_dbi, ...){
dots <- rlang::enquos(...)
tbl_dbi %>%
ungroup() %>%
mutate_at(vars(!!!dots), funs(case_when(
. == "" ~ "blank",
is.na(.) ~ "NA",
TRUE ~ "not blank/NA"
))) %>%
count(!!!dots)
}
# pull info from database needed to construct visdat plot
count_missing(tbl_dbi, Age, Height)
#> # Source: lazy query [?? x 3]
#> # Database: sqlite 3.19.3 []
#> # Groups: Age
#> Age Height n
#> <chr> <chr> <int>
#> 1 NA NA 16
#> 2 NA not blank/NA 24
#> 3 not blank/NA NA 24
#> 4 not blank/NA not blank/NA 36
Created on 2018-04-06 by the reprex package (v0.2.0).
Thanks for filing an issue, and for the nice illustrative example!
I would like to have visdat work with objects "tbl_dbi" "tbl_sql" "tbllazy" "tbl" - the way that I think would be best to address this would be to create separate functions for each of the `visfamily to operate on
tbl_dbi` objects.
So currently there is vis_dat()
which checks if you have a dataframe, but then there would be vis_dat.tbl_dbi
, which would "just work" with tbl_dbi
- so you'd type vis_dat(tbl_dbi)
and you'd get a plot that you would expect.
This might be made easier by some of the ideas in retrieving the data, described in #78.
At the moment I don't think that this will make it into the next release of visdat (submitting today or tomorrow to CRAN), but I would be really interested to discuss this in more detail - would you be interested in discussing this further?
Sure! I think that it would be handy to easily get a missing data plot from a database table. The application I'm thinking of would be for use with large datasets that are too big to get into R. Your idea of making the vis_dat
function generic and creating a tbl_dbi method seems like a good one to me. For column oriented databases creating a viz based on a few columns should be very fast even for a huge number of rows.
Would you be able to create a lower level internal function that would accept something like the table produced by the count_missing
function I defined above and produce a viz?
Is it possible to use visdat with a database table (referenced by a tbl_dbi object in R)?