tidymodels / infer

An R package for tidyverse-friendly statistical inference
https://infer.tidymodels.org
Other
726 stars 80 forks source link

Allow visualize() to give density curve on non theoretical distribution #539

Closed nicolasfoss closed 3 months ago

nicolasfoss commented 3 months ago

Intro

Hello! First, thank you for taking time to review this issue. This is my first time posting an issue on Github so I might be making some mistakes. Please let me know how to improve this so I can make the material more helpful or to save you time.

Issue

In cases where you may want to overlay a density curve over the histogram produced by visualize(method = "simulation"), you cannot run visualize(method = "both") or you will get an error. It might be a helpful option to see shading under a density curve instead of thr histogram shading. See below for a reprex and example of what I propose be available for the empirical distribution as well via method = "simulation". This would be quite helpful for cases when you are looking at categorical data, as the error you get (see below) mentions that.

I will admit...

I do realize there is documentation showing how to use visualize() with both the theoretical and empirical distributions, but my use case here did not apply to the documentation / vignette.

# visualize oob
## set random number seed

set.seed(123)

## create data
faux_data <- tibble(dead = sample(
  x = factor(c("yes", "no"), levels = c("yes", "no")), 
  size = 100, replace = T),
  gender = sample(
    x = factor(c("male", "female"), levels = c("male", "female")),
    size = 100,
    replace = T
    ))

## summarize data

faux_summary <- faux_data %>% 
  summarize(N = n(), 
            prop_dead = mean(dead == "yes"), 
            count_dead = sum(dead == "yes"), 
            .by = gender
            ) # females died more

## get difference

faux_diff <- faux_data %>% 
  specify(dead ~ gender, success = "yes") %>% 
  calculate(stat = "diff in props", order = c("male", "female")) %>%  # negative diff shows females died more
  pull()

## prop test
faux_test <- faux_data %>% 
  prop_test(dead ~ gender, order = c("male", "female")) # not significantly different

## model
faux_model <- faux_data %>% 
  specify(dead ~ gender, success = "yes") %>% 
  hypothesise(null = "independence") %>% 
  generate(reps = 1000, type = "permute") %>% 
  calculate(stat = "diff in props", order = c("male", "female"))

## visualize oob performance

faux_model %>% visualize() # this works fine, and you can shade the areas where the obs. stat needs to fall to be significant

## try to get the density curve - error
faux_model %>% visualize(method = "both") 

It would be great for visualize to produce the density curve for non-theoretical distributions like the one produced above here is a home grown example.

# solution
calculate_critical_values <- function(data, stat_col, alpha = 0.05) {
  data %>%
    summarise(
      lower = quantile({{stat_col}}, probs = alpha / 2),
      upper = quantile({{stat_col}}, probs = 1 - (alpha / 2))
    )
}

## get critical values

faux_crit <- faux_model %>% 
  calculate_critical_values(stat_col = stat, alpha = 0.05)

lower_crit <- faux_crit$lower

upper_crit <- faux_crit$upper

## modify df to classify stat by critical values
faux_model_mod <- faux_model %>% 
  mutate(area = if_else(stat <= lower_crit | stat >= upper_crit, TRUE, FALSE))

## Calculate density

density_data <- density(faux_model$stat)
density_df <- data.frame(x = density_data$x, y = density_data$y)

## plot the distribution of differences between proportions

{

  faux_model_diff_plot <- faux_model_mod %>% 
    ggplot(aes(x = stat, label = round(faux_diff, digits = 5))) + 
    geom_histogram(aes(y = after_stat(density)), bins = 15, fill = "lightgray", color = "black", alpha = 0.5) +
    geom_line(data = density_df, aes(x = x, y = y), color = "blue", linewidth = 1.25, alpha = 0.5) +
    geom_area(data = density_df %>% filter(x <= lower_crit), aes(x = x, y = y), fill = "coral1", alpha = 0.5) +
    geom_area(data = density_df %>% filter(x >= upper_crit), aes(x = x, y = y), fill = "coral1", alpha = 0.5) +
    geom_area(data = density_df %>% filter(x <= upper_crit, x >= lower_crit), aes(x = x, y = y), fill = "dodgerblue", alpha = 0.5) +
    ggplot2::annotate(geom = "segment", x = faux_diff, y = 0, xend = faux_diff, yend = Inf,
                      color = "coral1",
                      linewidth = 2.25) + 
    ggplot2::annotate(
      geom = "text",
      x = faux_diff - (faux_diff * 0.1),
      y = 3,
      label = paste0("diff = ", pretty_number(faux_diff, n_decimal = 5)),
      size = 8,
      family = "Work Sans",
      angle = 90
    ) + 
    labs(x = "Differences in Proportions",
         y = "Count / Kernel Density",
         title = "Differences in Proportions of Deceased Males and Females",
         subtitle = "Simulation-Based Null Distribution from 1,000 Permutated Samples",
         caption = paste0("- p = ", round(faux_test$p_value, digits = 3))
    ) + 
    theme_clean()

}

Thanks!

Thanks again for your time and I look forward to interacting with you.

simonpcouch commented 3 months ago

Thanks for the issue! Hope you don't mind that I edited your issue description to adjust code formatting.

I'll first say that stat = "z" is a "theorized" test statistic supported by infer that can apply those distributional assumptions to plot a curve for you:

library(infer)
library(tidyverse)

set.seed(123)

faux_data <- tibble(dead = sample(
   x = factor(c("yes", "no"), levels = c("yes", "no")), 
   size = 100, replace = T),
   gender = sample(
      x = factor(c("male", "female"), levels = c("male", "female")),
      size = 100,
      replace = T
   ))

faux_summary <- faux_data %>% 
   summarize(N = n(), 
             prop_dead = mean(dead == "yes"), 
             count_dead = sum(dead == "yes"), 
             .by = gender
   )

faux_diff <- faux_data %>% 
   specify(dead ~ gender, success = "yes") %>% 
   calculate(stat = "z", order = c("male", "female")) %>%
   pull()

faux_test <- faux_data %>% 
   prop_test(dead ~ gender, order = c("male", "female")) 

faux_model <- faux_data %>% 
   specify(dead ~ gender, success = "yes") %>% 
   hypothesise(null = "independence") %>% 
   generate(reps = 1000, type = "permute") %>% 
   calculate(stat = "z", order = c("male", "female"))

faux_model %>% visualize()


faux_model %>% visualize(method = "both")
#> Warning: Check to make sure the conditions have been met for the theoretical method.
#> infer currently does not check these for you.

Created on 2024-08-19 with reprex v2.1.1

In general, I think we feel that histograms best represent the empirical distributions that can be created with infer pipelines, and likely won't extend that functionality in future releases of the package. That said, the outputs of visualize() are just ggplot objects, and you're able to add elements to them as you please! With your example and a call to geom_density():

library(infer)
library(tidyverse)

set.seed(123)

faux_data <- tibble(dead = sample(
   x = factor(c("yes", "no"), levels = c("yes", "no")), 
   size = 100, replace = T),
   gender = sample(
      x = factor(c("male", "female"), levels = c("male", "female")),
      size = 100,
      replace = T
   ))

faux_summary <- faux_data %>% 
   summarize(N = n(), 
             prop_dead = mean(dead == "yes"), 
             count_dead = sum(dead == "yes"), 
             .by = gender
   )

faux_diff <- faux_data %>% 
   specify(dead ~ gender, success = "yes") %>% 
   calculate(stat = "diff in props", order = c("male", "female")) %>%
   pull()

faux_test <- faux_data %>% 
   prop_test(dead ~ gender, order = c("male", "female")) 

faux_model <- faux_data %>% 
   specify(dead ~ gender, success = "yes") %>% 
   hypothesise(null = "independence") %>% 
   generate(reps = 1000, type = "permute") %>% 
   calculate(stat = "diff in props", order = c("male", "female"))

faux_model %>% 
   visualize() +
   geom_density(aes(x = stat))

Created on 2024-08-19 with reprex v2.1.1

nicolasfoss commented 3 months ago

@simonpcouch Thanks so much for the guidance.

It looks like I can get what I was looking for already! I had not seen stat = "z" and I will make sure to dig deeper next time.

Have a great week, it was a pleasure learning with you at posit::conf.

simonpcouch commented 3 months ago

You're very welcome! You as well.

github-actions[bot] commented 2 months ago

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.