explore how to make visdat work with facetting #78

Closed njtierney closed 1 year ago

njtierney commented 6 years ago

as per Sam Firke's tweet:

njtierney commented 6 years ago

Some thoughts on this.

I think that one good way forward, rather than (perhaps only) supplying a "facet" argument as in the `naniar::gg*` family, there could be a "data method" for visdat.

This is already kind of provided, I think, in vis_gather_.

This could instead be exported, and called something (slightly) better like data_vis_dat. These data_* methods would provide the underlying data structure.

These could then have a .grouped_df method. So you would do something like

data %>%
  group_by(grouping) %>%
  # get the data structure
  data_vis_dat() %>%
  # perhaps vis_dat gains some S3 methods, so that it works with a grouped_df, and maybe has a special `.vis_dat` class?

This seems like a lot more work than just:

vis_dat(data, facet = grouping)

But it would allow for perhaps more flexible operations.

I don't think I can use facet as in regular ggplot, since that usually requires a change in the datastructure first.

njtierney commented 6 years ago

I want to pursue this idea, but at a later date

jzadra commented 4 years ago

Just a note, I repeatedly need this ability and so wrote a little hack using the patchwork package that makes individual vis_dat() plots for each index value and then combines them into a single plot. This was critical in showing me where I had a missing year of data that I had not realized previously. Given that a primary use of visdat is to visualize missing values, I am even more convinced that this would feature would be incredibly value.

See example below (this is data from the IPEDS data on higher ed institutions):


If anyone wants to take my code and modify it to their own purpose, here you go (don't judge me, it was a rush job). This is custom for a specific purpose (IPEDS data), so will take a little work to generalize. And I'm not suggesting this as a good method for the actual visdat package, just a hack for anyone to use in the mean time.

ipeds_visdat <- function(.data, years = "all", .sample_frac = .10) {

  #Check that data is ipeds survey
  if(!all(c("unitid", "year") %in% names(.data))) warning(".data does not contain a unitid or year column.  Are you sure you passed an ipeds survey?")

  #Make sure years is set
  if(!all(years == "all" | is.numeric(years))) stop("\`years\` must be \"all\" or a numeric vector of 4-digit years.")

  if(all(years == "all")) years <- min(.data$year):max(.data$year)

  if(.sample_frac < 1) {
    cli::cli_alert_info("Sampling data at {.sample_frac * 100}% per year.")

    .data <- .data %>%
      dplyr::group_by(year) %>%
      dplyr::sample_frac(.sample_frac) %>%
  } else cli::cli_alert_info("Using 100% of data, this may be slow.")

  p1 <- .data %>%
    dplyr::filter(year == years[1]) %>% visdat::vis_dat(warn_large_data = F, sort_type = F, palette = "qual") +
      ggplot2::labs(y = years[1]) + ggplot2::theme(plot.margin = ggplot2::margin(0, 5.5, 0, 5.5, "pt"))

  plist <- tibble::lst()
  plist[[1]] <- p1

  if(length(years > 1)) {
    for(i in 2:length(years)) {
      plist[[i]] <- .data %>%
        dplyr::filter(year == years[{i}]) %>%
        visdat::vis_dat(warn_large_data = F, sort_type = F, palette = "qual") +
        ggplot2::labs(y = years[{i}]) +
        ggplot2::theme(axis.text.x = ggplot2::element_blank(), plot.margin = ggplot2::margin(0, 5.5, 0, 5.5, "pt"))


  patchwork::wrap_plots(plist, ncol = 1, guides = "collect")

njtierney commented 1 year ago

@jzadra I've worked on an approach for this in, how does this look to you? Currently I've just got vis_dat and vis_cor:

vis_dat(airquality, facet = Month)


vis_cor(airquality, facet = Month)

airquality %>% data_vis_dat()
jzadra commented 1 year ago

Hi @njtierney,

I think this is a great addition! I think it would be nice if there was an option for how the facets were organized just like in ggplot, as far as number of cols/rows. In many of my use cases, having the data all in one column is much easier to understand at a glance when the grouping variable is continuous or ordinal. The other feature that would help is some sampling options for large data.

Since I last posted, I greatly improved my function to be generalizable to any data (before it was just for IPEDS). In addition, it has the following features:

  1. Handles multiple methods in line with vis_dat functions: vis_dat, vis_miss, vis_value
  2. Handles existing grouping structure (as does yours)
  3. Makes assumptions about taking a sample fraction for large data based on the method and distributes it evenly across groups: for vis_miss and vis_val, it keeps all data. For vis_dat it takes a fraction based on the number of rows.
  4. Has the option of using parallelization via furrr if a future::plan() is set (if it is not, the plan is sequential by default)


Anyways, I'll share this code in case any of it is useful.

#' vis_dat for grouped data
#' @description Produce a vis_dat plot for ipeds data split by year with optional sampling.
#' `r lifecycle::badge('maturing')`
#' Note that parallel processing is built in if a `future::plan()` is set
#' @importFrom magrittr "%>%"
#' @param ... bare, unquoted column(s) to use as the index to group by. Alternatively will accept a grouped df.
#' @param .sample_frac Percent of observations to sample from each year.  Default "auto" samples down to 100,000 rows, split evenly between groups for vis_dat. For vis_miss and vis_value, "auto" uses all data.
#' @param method Which visdat function to use. One of "vis_dat", "vis_miss", or "vis_value".  Accepts shorthand "dat", "val", and "miss".
#' @return visdat plot separated by grouping variable.
#' @examples
#' \dontrun{
#' diamonds %>% visdat_grouped(facet_group = cut)
#' }
#' @importFrom rlang .data
#' @export

visdat_grouped <- function(.data, ..., method = "vis_dat", .sample_frac = "auto") {

  is_pregrouped <- dplyr::is_grouped_df(.data) #Does the data already have grouping structure?

  #Set the visdat function to use
  if(stringr::str_detect(method, "dat")) method <- "dat"
  if(stringr::str_detect(method, "val")) method <- "val"
  if(stringr::str_detect(method, "miss")) method <- "miss"

  # for val and miss we want to see all the data, hence auto = 1
  if((method == "val" | method == "miss") & .sample_frac == "auto") .sample_frac = 1

  # Otherwise downsmample
  if(.sample_frac == "auto") {
    if(nrow(.data) > 100000) {
      .sample_frac <- 100000 / nrow(.data)
      cli::cli_alert_info("Large data, automatically down-sampling data at {round(.sample_frac * 100)}%. To disable or change, set .sample_frac to a value between 0 and 1.")
    } else .sample_frac <- 1

  #Group the data
  if(is_pregrouped) {
    .data <- .data %>%
      tibble::add_column(group_index = dplyr::group_indices(.)) %>%
      tidyr::unite(group_name, dplyr::group_vars(.), sep = "\n", remove = F) %>%
  } else {
    .data <- .data %>%
      dplyr::group_by(...) %>%
      tibble::add_column(group_index = dplyr::group_indices(.)) %>%
      tidyr::unite(group_name, ..., sep = "\n", remove = F) %>%

  # Do any sampling
  if(.sample_frac < 1) {

    .data <- .data %>%
      dplyr::sample_frac(.sample_frac / dplyr::n_groups(.)) #Needs to be updated, as sample_frac() is superseded. However sample_frac applies the fraction to each group if the data is grouped.

  } else cli::cli_alert_info("Using 100% of data, this may be slow.")

  #Split the data
  .data <- .data %>% dplyr::group_split(.keep = F)

  #Methods for each visdat graph
  if(method == "dat") {
    plist <- .data %>%
      furrr::future_map(function(...) {
        .data <- tibble::as_tibble(...)

        group_name <- .data %>% dplyr::distinct(group_name) %>% dplyr::pull(group_name)
        group_index <- .data %>% dplyr::distinct(group_index) %>% dplyr::pull(group_index)

        .data <- .data %>% dplyr::select(-group_name, -group_index)

        p <- .data %>%
          visdat::vis_dat(warn_large_data = F, sort_type = F, palette = "qual") +
          ggplot2::labs(y = group_name) +
          ggplot2::theme(plot.margin = ggplot2::margin(0, 5.5, 0, 5.5, "pt"))

        if(group_index > 1) {
          p <- p + ggplot2::theme(axis.text.x = ggplot2::element_blank(),
                                  plot.margin = ggplot2::margin(0, 5.5, 0, 5.5, "pt"))

  if(method == "val") {
    plist <- .data %>%
      furrr::future_map(function(...) {
        .data <- tibble::as_tibble(...)

        group_name <- .data %>% dplyr::distinct(group_name) %>% dplyr::pull(group_name)
        group_index <- .data %>% dplyr::distinct(group_index) %>% dplyr::pull(group_index)

        .data <- .data %>% dplyr::select(-group_name, -group_index)

        p <- .data %>%
          dplyr::select(tidyselect::where(is.numeric)) %>%
          visdat::vis_value() +
          ggplot2::labs(y = group_name) +
          ggplot2::theme(plot.margin = ggplot2::margin(0, 5.5, 0, 5.5, "pt"))

        if(group_index > 1) {
          p <- p + ggplot2::theme(axis.text.x = ggplot2::element_blank(),
                                  plot.margin = ggplot2::margin(0, 5.5, 0, 5.5, "pt"))

  if(method == "miss") {
    plist <- .data %>%
      furrr::future_map(function(...) {
        .data <- tibble::as_tibble(...)

        group_name <- .data %>% dplyr::distinct(group_name) %>% dplyr::pull(group_name)
        group_index <- .data %>% dplyr::distinct(group_index) %>% dplyr::pull(group_index)

        .data <- .data %>% dplyr::select(-group_name, -group_index)

        p <- .data %>%
          dplyr::select(tidyselect::where(is.numeric)) %>%
          visdat::vis_miss(show_perc = T, warn_large_data = F) +
          ggplot2::labs(y = group_name) +
          ggplot2::theme(plot.margin = ggplot2::margin(0, 5.5, 0, 5.5, "pt"))

        if(group_index > 1) {
          p <- p + ggplot2::theme(axis.text.x = ggplot2::element_blank(),
                                  plot.margin = ggplot2::margin(0, 5.5, 0, 5.5, "pt"))

  patchwork::wrap_plots(plist, ncol = 1, guides = "collect")
