jthomasmock / gtExtras

A Collection of Helper Functions for the gt Package.
https://jthomasmock.github.io/gtExtras/
Other
195 stars 27 forks source link

gt_plt_summary(): can fail to plot overview for numeric data due to binwidth computation failure #104

Closed tjkelman closed 1 year ago

tjkelman commented 1 year ago

Prework

Didn't see anything related in any of the open or closed issues.

Description

When using gt_plt_summary() for numeric fields, the plot overview (histogram) fails to plot and there's an error that "'binwidth' must be positive". What's happening is that for certain data with a low number of possible values that mostly take on a single value, the IQR is zero and auto binwidth algorithm suggested by Rob Hyndman fails.

In this specific case I have a numeric variable with about 800 rows, each represents a rating (of quality of barn construction) from 1-5 inclusive, but most of the values are 3 with just a smattering of 1, 2, 4 & 5s. So the 1/4 and 3/4 quantiles both end up being 3, and thus IQR is zero. Then line 221 in gt_summary_table.R:

bw <- 2 * IQR(col, na.rm = TRUE) / length(col)^(1 / 3)

results in bw being zero because IQR is zero. As a quick fix, perhaps if bw ends up zero just set it to 1? I'm trying to think of a use case where that fails and I'm too exhausted to come up with one right now...

tjkelman commented 1 year ago

Here's a reprex that demonstrates the issue. This is my first reprex (and my first time filing a github issue for anything) - so please don't hesitate to let me know if I there's a better/preferred way to do any of this.


library(gt)
library(gtExtras)

input_data <- data.frame("ratings" = c(1,1,2,3,3,3,3,3,3,3,3,4,4,5))

gt_plt_summary(input_data)
#> Warning: Computation failed in `stat_bin()`
#> Caused by error in `bin_breaks_width()`:
#> ! `binwidth` must be positive
plt_summary
jthomasmock commented 1 year ago

I have a potential fix locally and will push later today. In short, used Freedman-Diaconis rule, but relying on breaks vs binwidth in case of bw <= 0

library(gtExtras)
input_data <- data.frame("ratings" = c(1,1,2,3,3,3,3,3,3,3,3,4,4,5))

gt_plt_summary(input_data)

image

tjkelman commented 1 year ago

So unfortunately I believe that solution is generating separate histogram plots with the hist() call for any variables whose binwidth does turn out to be 0.

jthomasmock commented 9 months ago

Thanks for the followup! I'm going to make that change locally and push out.